Multiple Regression. Multiple regression Typically, we want to use more than a single predictor (independent variable) to make predictions Regression

Multiple Regression

Multiple regressionTypically, we want to use more than a single predictor (independent variable) to make predictions

Regression with more than one predictor is called “multiple regression”

€

y i =α + β1x1i + β 2x2i +K + β px pi + ε i

Motivating example: Sex discrimination in wagesIn 1970’s, Harris Trust and Savings Bank was sued for discrimination on the basis of sex.

Analysis of salaries of employees of one type (skilled, entry-level clerical) presented as evidence by the defense.

Did female employees tend to receive lower starting salaries than similarly qualified and experienced male employees?

Variables collected 93 employees on data file (61 female, 32 male).

bsal: Annual salary at time of hire.sal77 : Annual salary in 1977.educ: years of education.exper: months previous work prior to hire at bank.

fsex: 1 if female, 0 if malesenior: months worked at bank since hiredage: months

So we have six x’s and and one y (bsal). However, in what follows we won’t use sal77.

Comparison for male and femalesThis shows men started at higher salaries than women (t=6.3, p<.0001).

But, it doesn’t control for other characteristics.

bsal

4000

5000

6000

7000

8000

Female Male

fsex

Oneway Analysis of bsal By fsex

Relationships of bsal with other variablesSenior and education predict bsal well. We want to control for them when judging gender effect.

4000

5000

6000

7000

8000

bsal

60 65 70 75 80 85 90 95 100

senior

Linear Fit

Bivariate Fit of bsal By senior

4000

5000

6000

7000

8000

bsal

300 400 500 600 700 800

age

Linear Fit

Bivariate Fit of bsal By age

4000

5000

6000

7000

8000

bsal

7 8 9 10 11 12 13 14 15 16 17

educ

Linear Fit

Bivariate Fit of bsal By educ

4000

5000

6000

7000

8000

bsal

-50 0 50 100 150 200 250 300 350 400

exper

Linear Fit

Bivariate Fit of bsal By exper

Fit Y by X Group

Multiple regression modelFor any combination of values of the predictor variables, the average value of the response (bsal) lies on a straight line:

Just like in simple regression, assume that ε follows a normal curve within any combination of predictors.

€

bsali =α + β1fsexi + β 2seniori + β 3age i + β 4educ i + β 5experi + ε i

Output from regression (fsex = 1 for females, = 0 for males)

Term Estimate Std Error t Ratio Prob>|t| Int. 6277.9 652 9.62 <.0001

Fsex -767.9 128.9 -5.95 <.0001

Senior -22.6 5.3 -4.26 <.0001

Age 0.63 .72 .88 .3837

Educ 92.3 24.8 3.71 .0004

Exper 0.50 1.05 .47 .6364

4000

5000

6000

7000

8000

bsal A

ctu

al

4000 5000 6000 7000 8000

bsal Predicted P<.0001 RSq=0.52

RMSE=508.09

Actual by Predicted Plot

RSquare

RSquare Adj

Root Mean Square Error

Mean of Response

Observations (or Sum Wgts)

0.515156

0.487291

508.0906

5420.323

93

Summary of Fit

Model

Error

C. Total

Source

5

87

92

DF

23863715

22459575

46323290

Sum of Squares

4772743

258156

Mean Square

18.4878

F Ratio

<.0001

Prob > F

Analysis of Variance

Intercept

fsex

senior

age

educ

exper

Term

6277.8934

-767.9127

-22.5823

0.6309603

92.306023

0.5006397

Estimate

652.2713

128.97

5.295732

0.720654

24.86354

1.055262

Std Error

9.62

-5.95

-4.26

0.88

3.71

0.47

t Ratio

<.0001

<.0001

<.0001

0.3837

0.0004

0.6364

Prob>|t|

Parameter Estimates

fsex

senior

age

educ

exper

Source

1

1

1

1

1

Nparm

1

1

1

1

1

DF

9152264.3

4694256.3

197894.0

3558085.8

58104.8

Sum of Squares

35.4525

18.1838

0.7666

13.7827

0.2251

F Ratio

<.0001

<.0001

0.3837

0.0004

0.6364

Prob > F

Effect Tests

-1000

-500

0

500

1000

1500

bsal R

esid

ual

4000 5000 6000 7000 8000

bsal Predicted

Residual by Predicted Plot

Whole Model age educ exper

Response bsal

PredictionsExample: Prediction of beginning wages for a woman with 10 months seniority, that is 25 years old, with 12 years of education, and two years of experience:

Pred. bsal = 6277.9 - 767.9*1 - 22.6*10 + .63*300 + 92.3*12 + .50*24

= 6592.6

€

bsali =α + β1fsexi + β 2seniori + β 3age i + β 4educ i + β 5experi + ε i

Interpretation of coefficients in multiple regressionEach estimated coefficient is amount Y is expected to increase when the value of its corresponding predictor is increased by one, holding constant the values of the other predictors.

Example: estimated coefficient of education equals 92.3. For each additional year of education of employee, we expect salary to increase by about 92 dollars, holding all other variables constant.

Estimated coefficient of fsex equals -767.

For employees who started at the same time, had the same education and experience, and were the same age, women earned $767 less on average than men.

Which variable is the strongest predictor of the outcome?The coefficient that has the strongest linear association with the outcome variable is the one with the largest absolute value of T, which equals the coefficient over its SE.

It is not size of coefficient. This is sensitive to scales of predictors. The T statistic is not, since it is a standardized measure.

Example: In wages regression, seniority is a better predictor than education because it has a larger T.

Hypothesis tests for coefficientsThe reported t-stats (coef. / SE) and p-values are used to test whether a particular coefficient equals 0, given that all other coefficients are in the model.

Examples:

1) Test whether coefficient of education equals zero has p-value = .0004. Hence, reject the null hypothesis; it appears that education is a useful predictor of bsal when all the other predictors are in the model.

2) Test whether coefficient of experience equals zero has p-value = .6364. Hence, we cannot reject the null hypothesis; it appears that experience is not a particularly useful predictor of bsal when all other predictors are in the model.

Hypothesis tests for coefficientsThe test statistics have the usual form (observed – expected)/SE.

For p-value, use area under a t-curve with (n-k) degrees of freedom, where k is the number of terms in the model.

In this problem, the degrees of freedom equal (93-6=87).

CIs for regression coefficientsA 95% CI for the coefficients is obtained in the usual way:

coef. ± (multiplier) SE

The multiplier is obtained from the t-curve with (n-k) degrees of freedom. (If degrees of freedom is greater than 26 use normal table)

Example: A 95% CI for the population regression coefficient of age equals:

(0.63 – 1.96*0.72, 0.63 + 1.96*0.72)

Warning about tests and CIsHypothesis tests and CIs are meaningful only when the data fits the model well.

Remember, when the sample size is large enough, you will probably reject any null hypothesis of β=0.

When the sample size is small, you may not have enough evidence to reject a null hypothesis of β=0.

When you fail to reject a null hypothesis, don’t be too hasty to say that a predictor has no linear association with the outcome. It is likely that there is some association, it just isn’t a very strong one.

Checking assumptionsPlot the residuals versus the predicted values from the regression line.

Also plot the residuals versus each of the predictors.

If non-random patterns in these plots, the assumptions might be violated.

Plot of residuals versus predicted valuesThis plot has a fan shape.

It suggests non-constant variance (heteroscedastic).

We need to transform variables.

-3000

-2000

-1000

0

1000

2000

3000

4000

5000

sal77

Res

idual

7000 9000 11000 13000 15000 17000

sal77 Predicted

Residual by Predicted Plot

Whole Model

Response sal77

Plots of residuals vs. predictors

-1000

-500

0

500

1000

1500

Res

idua

l bsa

l 2

-50 0 50 100 150 200 250 300 350 400

exper

Bivariate Fit of Residual bsal 2 By exper

-1000

-500

0

500

1000

1500

Res

idua

l bsa

l 2

-0.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 1.1

fsex

Bivariate Fit of Residual bsal 2 By fsex

Fit Y by X Group

-1000

-500

0

500

1000

1500

Res

idua

l bsa

l 2

60 65 70 75 80 85 90 95 100

senior

Bivariate Fit of Residual bsal 2 By senior

-1000

-500

0

500

1000

1500

Res

idua

l bsa

l 2

300 400 500 600 700 800

age

Bivariate Fit of Residual bsal 2 By age

-1000

-500

0

500

1000

1500

Res

idua

l bsa

l 2

7 8 9 10 11 12 13 14 15 16 17

educ

Bivariate Fit of Residual bsal 2 By educ

Fit Y by X Group

Summary of residual plotsThere appears to be a non-random pattern in the plot of residuals versus experience, and also versus age.

This model can be improved.

Modeling categorical predictorsWhen predictors are categorical and assigned numbers, regressions using those numbers make no sense.

Instead, we make “dummy variables” to stand in for the categorical variables.

CollinearityWhen predictors are highly correlated, standard errors are inflated

Conceptual example:Suppose two variables Z and X are exactly the same.

Suppose the population regression line of Y on X isY = 10 + 5X

Fit a regression using sample data of Y on both X and Z. We could plug in any value for the coefficients of X and Z, so long as they add up to 5. Equivalently this means that the standard errors for the coefficients are huge

General warnings for multiple regressionBe even more wary of extrapolation. Because there are several predictors, you can extrapolate in many ways

Multiple regression shows association. It does not prove causality. Only a carefully designed observational study or randomized experiment can show causality

Documents

Multiple Regression. Multiple regression Typically, we want to use more than a single predictor (independent variable) to make predictions Regression