of 44 /44
ANOVA, Regression and Multiple Regression March 22-23

# ANOVA, Regression and Multiple Regression March 22-23

Embed Size (px)

Citation preview

ANOVA, Regression and Multiple Regression

March 22-23

Why ANOVA

• ANOVA allows us to compare the means for groups and ask if they are sufficiently different from one another to say that such differences are statistically significant.

• In other words, that there is a low probability that a difference of such magnitude would occur between groups in the real world if the actual difference between groups were zero.

• Unlike “t” tests, ANOVA can be performed with more than two groups

• The statistic associated with ANOVA is the “f” test.

As the book notes…

• “The details of ANOVA are a bit daunting (they appear in an optional section at the end of this chapter). The main idea of ANOVA is more accessible and much more important. Here it is: when we ask if a set of sample means gives evidence for differences among the population means, what matters is not how far apart the sample means are but how far apart they are relative to the variability of individual observations.”

In other words, lets look both at the means and the overlap of the distributions!

Statistically speaking

• The f statistics looks at variation of the sample means over variation among individuals in the same sample

• Like “t” the “f” statistic is very robust and therefore you should not worry too much about deviations from normality if your sample is large

samplesametheincasesindividualamongiation

meanssampletheamongiationF

_______var

____var

One warning

• ANOVA assumes that the variability of observations, measured by the standard deviation is the same in all populations

• In the real world if you keep the sizes of the groups you are comparing roughly similar few problems occur but you must check.

The book gives this rule

• Results of the “f” test are usually okay if the largest sample standard deviation is no more than twice as large as the smallest sample standard deviation

• Another way to check is “Levene’s equality of variance” statistic.

• If it is significant (low p) it means there is little probability that the standard deviations of the groups being compared are similar and you have a problem

From Table 24.2 (richness of trees by Group)

Why Regression

• Regression is commonly used in the social sciences because it allows us to– Explain– Predict which are two of the big goals of

social science (along with describe)

Recall

• Regression involves mathematically describing a linear relationship between a response (or dependent) variable and an explanatory (or independent) variable

• That line is given in the form y = a +b(x) Where:– y is the response variable– a is the y axis intercept of the line– b is the slope of the line– X is the explanatory variable

Requirements for use of regression• Also recall that if the relationship between our response

and explanatory variable is not linear then regression will give misleading results. Therefore we always do a scatter plot before attempting regression. The mathematical notation for linearity is below.

• This is sometimes called the “least-squares regression line” because this regression procedure finds the line that is the least squared difference from each data point.

bxy

Regression requirements continued

• For any value of x the values of y are normally distributed and repeated responses of y are independent of each other

• The standard deviation of y is the same for all values of x

Regression Analysis • As well as estimating the regression line, we

also estimate the goodness of fit between the line and the data by using a statistic known as Rsq

• Rsq (as the name implies) is the square of the correlation measure known as “r”.

• We also have to know the significance of the association between the explanatory and response variables (as well as the coefficient “a”) for the line we have found we use a variation of the “t” test for this.

A useful tool: Regression Standard Error

• The regression standard error is a useful tool that can help us diagnosis whether we have met the various conditions needed to perform a regression (don’t worry your software will do this).

)ˆ(2

2

1 yyn

s

So looking at example 23.1 in your book here is the scatter plot

Here is the regression control showing how I have selected the standard errors

called “residuals” in SPSS

Here is the dialogue box in Excel using the plug in

Here is a portion of the printout that was generated in SPSS

Here you can see the standard residuals or errors that were calculated

In Excel it looks like this

A happy coincidence

• As the book notes Rsq is “closely related” to r.• In fact it is literally the sq. of r in a simple OLS

regression with one explanatory variable.• Therefore, when you test the null hypothesis of

the regression line. That it is actually flat you also pretty much have tested the correlation too. However, most software also prints it out in case you want to see it.

Does it matter that the estimate of the intercept is insignificant?

• In practice no.

• What really matters is the estimate of the slope

Calculating the confidence interval for your line

• If you look back at our print out you will see the slope is given as is the Standard Error of the Slope and a t value. Put them together and you have the 95% confidence interval for the Population slope.

confidenceSEbtb %95*

Or you could have had the computer calculate the confidence intervals for you

As noted before we can use our standardized errors to check our

assumptions• The y values vary normally for each x value, do

a histogram of your residuals and check for relative normality of distribution

• Plot the residuals as the dependent variable with the x variable as independent to check for linearity and that the observations for Y are independent of each other

• Standard deviation of responses can be checked by looking for a rough symmetrical distribution above and below the zero point

Our previous example had too few cases to check residuals so here is

example 23.9 from the book on climate change

Moving from OLS to Multiple OLS three big changes

• 1 We have to use Beta instead of B• 2 We have to be aware of multicollinearity and

other multiple impacts (in short that we are not just piling on independent variables but that each independent variable is demonstrating a unique explanatory power

• The book gives you a third. We have to be aware of interaction terms and other factors that lead us to pick one model over another

The equation is now changed to reflect greater number of variables and change

from b to beta

xxx nny

2211

How to do it?

• Start from the beginning and look at each variable separately using our descriptive and exploratory techniques

• Now look at our dependent variable in pairs with each independent variable using correlations to see which ones might have a big impact

• Fit different models. Pay attention to changes in explanatory power and also the t statistics

• If using stats software use stepwise procedures. Stepwise adds and removes variables in the order you input them based on a selection criteria (change in the F statistic from the ANOVA test). In short the computer tells you which model best fits with as few variables as possible.

In SPSS You can do more than one scatterplot at once

The data provided in Table 27.6 represent a random sample of 60 customers from a large clothing retailer.8 The manager of the store is interested in predicting how much a

customer will spend on his or her next purchase.Our goal is to find a regression model for predicting the amount of a purchase from the available explanatory

variables. A short description of each variable is provided below.

Here are the print outs for Ex 27.19 using SPSS

• Purchase 12 shows the total purchases each customer makes over the last 12 months divided by the frequency of their visits to the store

As you will see it changes things Here is the OLS for it alone

• The last slide was basically an interaction of the two variables we previously identified as helpful. Let’s go back to when they were separate for a second and test whether each has a separate impact or if multicollinearity is at play. Look for tolerances .1 or less as evidence of multicollinearity

Finally let’s look at our residual plots

• Often you might have the chance to use more elaborate residuals than standardized ones, such as studentized residuals.

• As there is no pattern we assume the variance for y is the same for all values of x

• The sequence chart also tells us that there the y values are independent of each other

• The QQ plot tells us the residuals are roughly normal meaning that the notion that values of y vary normally for each value of x might be met