28
Agenda for Week 7, Hour 1 ANOVA and R-squared revisited. Mulple regression and r-squared. Week 7, Hour 2 Mulple regression: co-linearity, perturbaons, correlaon matrix Stat 302 Notes. Week 7, Hour 1, Page 1 / 28

ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

Agenda for Week 7, Hour 1

ANOVA and R-squared revisited.

Multiple regression and r-squared.

Week 7, Hour 2

Multiple regression: co-linearity, perturbations,

correlation matrix

Stat 302 Notes. Week 7, Hour 1, Page 1 / 28

Page 2: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

Consider this made-up dataset on silicon wafers, wafers.csv. It’s based on a very common type of quality control analysis in manufacturing.

A factory manager is interested in reducing the number of bad wafers the factory produces in a batch.

She sets the factory to make 6 batches of wafers each at 3 levels of cooking temperature and 3 levels of spin speed. There are 54 batches of wafers in total. The response variable is number of bad wafers (in a batch of 1000).

Stat 302 Notes. Week 7, Hour 1, Page 2 / 28

Page 3: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

Here are select rows from the dataset.

'cooktemp' is the cooking temperature in Celcius

'spinrpm' is the spin rate while cooling, in RPM

'bad' is the number of bad wafers in the batch

Stat 302 Notes. Week 7, Hour 1, Page 3 / 28

Page 4: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

Note that even though we can describe temperature and speed as continuous variables, we are treating them as categories here.

Essentially we are calling them ‘low’, ‘medium’, and ‘high’ settings.

wafers$spinrpm = as.factor(wafers$spinrpm)

wafers$cooktemp = as.factor(wafers$cooktemp)

Stat 302 Notes. Week 7, Hour 1, Page 4 / 28

Page 5: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

Here is the one-way ANOVA of 'bad' using cooking temperature as an explanatory variable.mod = lm(bad ~ cooktemp, data=wafers)

anova(mod)

Stat 302 Notes. Week 7, Hour 1, Page 5 / 28

Page 6: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

p value is small, so we have strong evidence that cooking temperature matters.

Without the p-value, we could compare the obtained F to a critical value for F.

(Recall: F test is one-tailed, we only care about larger variances)

Stat 302 Notes. Week 7, Hour 1, Page 6 / 28

Page 7: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

A hypothesis test tells us that there some of variance in bad wafer count is explained by cooktemp.

It doesn't tell us how much of the variance is explained.

For that we need the Sum of Squares total,

which is SSgroup + SSresid = 727 + 2934 = 3661

Stat 302 Notes. Week 7, Hour 1, Page 7 / 28

Page 8: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

Proportion of variance explained, or R-squared

= SSgroup / SStotal

= 727 / 3661

= 0.1986, or 19.86% of variation explained.

Stat 302 Notes. Week 7, Hour 1, Page 8 / 28

Page 9: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

We can also get this information from the summary of the lm() object that we used to get the ANOVA in the first place.

There's no such thing as a correlation in an ANOVA, but the sometimes the ANOVA is referred to as having an R-squared because of this variance explained connection.

Stat 302 Notes. Week 7, Hour 1, Page 9 / 28

Page 10: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

A two-armed bird needs a two-way ANOVA

Stat 302 Notes. Week 7, Hour 1, Page 10 / 28

Page 11: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

Here is the two-way ANOVA using both cooking temperature and spin rate to explain 'bad'.

mod = lm(bad ~ spinrpm + cooktemp, data=wafers)

anova(mod)

Stat 302 Notes. Week 7, Hour 1, Page 11 / 28

Page 12: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

First, do we have evidence that the number of bad wafers change by temperature?

What about by spin speed?

Yes to both.

The p-values associated with each factor is small.

Stat 302 Notes. Week 7, Hour 1, Page 12 / 28

Page 13: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

So both factors are explaining a significant proportion of the variance. But how much?

We need the sum of squares total. This is 3661 , the total of the sum of squares from all sources: temperature, spin speed, and residuals.

Stat 302 Notes. Week 7, Hour 1, Page 13 / 28

Page 14: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

SStotal = SSspin + SStemp + SSresid

= 1840 + 727 + 1094

= 3661 (The same as in the one-way ANOVA)

Of this total, spin speed explains

SSspin / SStotal = 1840 / 3661 = 50.26%

of the variation, and temperature explains

SStemp / SStotal = 727 / 3661 = 19.86% of it.

Stat 302 Notes. Week 7, Hour 1, Page 14 / 28

Page 15: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

Both grouping variables together explain

2567 / 3661 = 70.12% of the variation

We can confirm this by looking at the linear model summary.

The multiple r-squared should match variance explained by the model. (i.e. everything but the residuals)

Stat 302 Notes. Week 7, Hour 1, Page 15 / 28

Page 16: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

Are they evolving, or are we regressing?

Stat 302 Notes. Week 7, Hour 1, Page 16 / 28

Page 17: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

Recall that in assignment 1, we looked at some national hockey league data. We made a model of wins as a function of goals against.

This is a simple regression model. The regression equation is

Wins = 78.83 – 0.163*GA +error

Stat 302 Notes. Week 7, Hour 1, Page 17 / 28

Page 18: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

Wins = 78.83 – 0.163*GA +error

…means that a team with 0 goals against it is expected to win 78.83 of their 82 games, and that every goal against the team costs it 0.168 wins.

In this model, goals against explained 42.21% of the variation in the number of wins.

Stat 302 Notes. Week 7, Hour 1, Page 18 / 28

Page 19: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

We can expand this from a simple regression into a multiple regression model by incorporating a second explanatory variable, Goals For (GF)

The regression equation is

Wins = 37.95 – 0.163*GA + 0.177*GF +error

Stat 302 Notes. Week 7, Hour 1, Page 19 / 28

Page 20: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

The regression equation is

Wins = 37.95 – 0.163*GA + 0.177*GF +error

…meaning that a team with both 0 goals against and 0 goals for will win 37.95 games (a bit fewer than half).

Every goal against will reduce this win count by 0.163 (holding “goals for” constant)

Every goal for will increase the win count by 0.177

(holding “goals against” constant)

Stat 302 Notes. Week 7, Hour 1, Page 20 / 28

Page 21: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

When doing a multiple regression, the slope coefficient associated with each variable is implicitly while “holding other variables constant”.

That means we take each slope effect separately, even if they often appear together.

Example:

If the team makes a chance (e.g. a trade or a coaching change) such that will score 5 more goals in a season, but also allow 3 more goals, then: Stat 302 Notes. Week 7, Hour 1, Page 21 / 28

Page 22: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

Adding 5 'goals for' and 3 'goals against'.

The effect of the additional goals against is to earn

0.163 * 3 = 0.489 fewer wins per season.

The effect of the increase in goals for is to earn

0.177 * 5 = 0.885 more wins per season.

The total effect is the sum of each separate effect, so with the change, we expect an increase of

(-0.489) + 0.885 = 0.396 wins this season.

Stat 302 Notes. Week 7, Hour 1, Page 22 / 28

Page 23: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

In this multiple regression model, goals for and against together explain 82.93% of the variation in wins.

In this model, it’s not surprising that including both 'goals for' and 'goals against' is better than including only one.

However, when including additional explanatory variables, r-squared always increases until it is 100%.

Stat 302 Notes. Week 7, Hour 1, Page 23 / 28

Page 24: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

Even if the new variables are completely random noise, the r-squared will increase by a little bit.

We use the ‘multiple r-squared’ in the model summary because it’s easy to interpret, but the adjusted r-squared is also useful, because it’s always a little less than the multiple r-squared to account for the amount that r-squared would increase from random noise.

Stat 302 Notes. Week 7, Hour 1, Page 24 / 28

Page 25: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

Question: Doesn’t goals for and goals against determine wins entirely? If you score more goals than your opponent, you win. End of story. Right?

Answer: For a single game, that’s true. But we don’t have data of this resolution.

We have the total goals for and against for the entire season, but not for individual games.

Stat 302 Notes. Week 7, Hour 1, Page 25 / 28

Page 26: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

When we aggregate data (e.g. add together the goalsfrom different games in a season), we lose some information.

Winning a game by 1 goal, or winning it by 50 goals both count as a single goal.

That’s where the remaining 17% unexplained variance is: in the differences between individual games.

Stat 302 Notes. Week 7, Hour 1, Page 26 / 28

Page 27: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

Question: Could there be such a terrible team that a model will predict to have less than 0 wins?

Answer: Yes. However, such a team would be an extreme outlier in the data.

We shouldn’t extrapolate and apply the model to cases far outside the data we have observed.

Stat 302 Notes. Week 7, Hour 1, Page 27 / 28

Page 28: ANOVA and R-squared revisited. Multiple regression and r ...jackd/Stat302/Wk07-1_Full.pdf · ANOVA and R-squared revisited. Multiple regression and r-squared. Week 7, Hour 2 Multiple

Break time

Stat 302 Notes. Week 7, Hour 1, Page 28 / 28