56
Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition

Multiple Regression ppt

Embed Size (px)

DESCRIPTION

Multiple Regression (Statistical)

Citation preview

Page 1: Multiple Regression ppt

Chap 14-1

Chapter 14

Introduction to Multiple Regression

Basic Business Statistics10th Edition

Page 2: Multiple Regression ppt

Chap 14-2

Learning Objectives

In this chapter, you learn: How to develop a multiple regression model How to interpret the regression coefficients How to determine which independent variables to

include in the regression model How to determine which independent variables are more

important in predicting a dependent variable How to use categorical variables in a regression model How to predict a categorical dependent variable using

logistic regression

Page 3: Multiple Regression ppt

Chap 14-3

The Multiple Regression Model

Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more independent variables (Xi)

ikik2i21i10i εXβXβXββY

Multiple Regression Model with k Independent Variables:

Y-intercept Population slopes Random Error

Page 4: Multiple Regression ppt

Chap 14-4

Multiple Regression Equation

The coefficients of the multiple regression model are estimated using sample data

kik2i21i10i XbXbXbbY

Estimated (or predicted) value of Y

Estimated slope coefficients

Multiple regression equation with k independent variables:

Estimatedintercept

In this chapter we will always use Excel to obtain the regression slope coefficients and other regression

summary measures.

Page 5: Multiple Regression ppt

Chap 14-5

Two variable modelY

X1

X2

22110 XbXbbY

Slope for v

ariable X 1

Slope for variable X2

Multiple Regression Equation(continued)

Page 6: Multiple Regression ppt

Chap 14-6

Example: 2 Independent Variables

A distributor of frozen desert pies wants to evaluate factors thought to influence demand

Dependent variable: Pie sales (units per week) Independent variables: Price (in $)

Advertising ($100’s)

Data are collected for 15 weeks

Page 7: Multiple Regression ppt

Chap 14-7

Pie Sales Example

Sales = b0 + b1 (Price)

+ b2 (Advertising)

WeekPie

SalesPrice

($)Advertising

($100s)1 350 5.50 3.32 460 7.50 3.33 350 8.00 3.04 430 8.00 4.55 350 6.80 3.06 380 7.50 4.07 430 4.50 3.08 470 6.40 3.79 450 7.00 3.5

10 490 5.00 4.011 340 7.20 3.512 300 7.90 3.213 440 5.90 4.014 450 5.00 3.515 300 7.00 2.7

Multiple regression equation:

Page 8: Multiple Regression ppt

Chap 14-8

Using Minitab

Stat | Regression | Regression … Enter appropriate variables for Response and Predictors

Page 9: Multiple Regression ppt

Chap 14-9

Minitab Multiple Regression Output

Regression Analysis: Pie Sales versus Price, Advertising

The regression equation isPie Sales = 307 - 25.0 Price + 74.1 Advertising

Predictor Coef SE Coef T PConstant 306.5 114.3 2.68 0.020Price -24.98 10.83 -2.31 0.040Advertising 74.13 25.97 2.85 0.014

S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2%

Analysis of Variance

Source DF SS MS F PRegression 2 29460 14730 6.54 0.012Residual Error 12 27033 2253Total 14 56493

Source DF Seq SSPrice 1 11100Advertising 1 18360

Page 10: Multiple Regression ppt

Chap 14-10

The Multiple Regression Equation

rtising)74.13(Adve e)24.98(Pric - 306.5 Sales

b1 = -24.98: sales will decrease, on average, by 24.98 pies per week for each $1 increase in selling price, net of the effects of changes due to advertising

b2 = 74.13: sales will increase, on average, by 74.13 pies per week for each $100 increase in advertising, net of the effects of changes due to price

where Sales is in number of pies per week Price is in $ Advertising is in $100’s.

Page 11: Multiple Regression ppt

Chap 14-11

Using The Equation to Make Predictions

Predict sales for a week in which the selling price is $5.50 and advertising is $350:

Predicted sales is 428.6 pies

428.6 (3.5) 74.13 (5.50) 24.98 - 306.5

rtising)74.13(Adve e)24.98(Pric - 306.5 Sales

Note that Advertising is in $100’s, so $350 means that X2 = 3.5

Page 12: Multiple Regression ppt

Chap 14-12

Predictions in MinitabSelect Stat | Regression | Regression … then click on the “options” box

Check the Confidence limits and Prediction limits boxes

Enter desired value for each independent variable

Page 13: Multiple Regression ppt

Chap 14-13

Predicted Values for New Observations

NewObs Fit SE Fit 95% CI 95% PI 1 428.6 17.2 (391.1, 466.1) (318.6, 538.6)

Values of Predictors for New Observations

NewObs Price Advertising 1 5.50 3.50

Input values

Predictions in Minitab(continued)

Predicted Y value

<

Confidence interval for the mean Y value, given these X’s

<

Prediction interval for an individual Y value, given these X’s

<Minitab output:

Page 14: Multiple Regression ppt

Chap 14-14

Coefficient of Multiple Determination

Reports the proportion of total variation in Y explained by all X variables taken together

squares of sum totalsquares of sum regression

SSTSSRr 2

Page 15: Multiple Regression ppt

Chap 14-15

Regression Analysis: Pie Sales versus Price, Advertising

The regression equation isPie Sales = 307 - 25.0 Price + 74.1 Advertising

Predictor Coef SE Coef T PConstant 306.5 114.3 2.68 0.020Price -24.98 10.83 -2.31 0.040Advertising 74.13 25.97 2.85 0.014

S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2%

Analysis of Variance

Source DF SS MS F PRegression 2 29460 14730 6.54 0.012Residual Error 12 27033 2253Total 14 56493

Source DF Seq SSPrice 1 11100Advertising 1 18360

.5215649329460

SSTSSRr2

52.1% of the variation in pie sales is explained by the variation in price and advertising

Multiple Coefficient of Determination

(continued)

Page 16: Multiple Regression ppt

Chap 14-16

Adjusted r2

r2 never decreases when a new X variable is added to the model This can be a disadvantage when comparing

models What is the net effect of adding a new variable?

We lose a degree of freedom when a new X variable is added

Did the new X variable add enough explanatory power to offset the loss of one degree of freedom?

Page 17: Multiple Regression ppt

Chap 14-17

Shows the proportion of variation in Y explained by all X variables adjusted for the number of X variables used

(where n = sample size, k = number of independent variables)

Penalize excessive use of unimportant independent variables Smaller than r2

Useful in comparing among models

Adjusted r2

(continued)

1kn

1n)r(11r 22adj

Page 18: Multiple Regression ppt

Chap 14-18

Regression Analysis: Pie Sales versus Price, Advertising

The regression equation isPie Sales = 307 - 25.0 Price + 74.1 Advertising

Predictor Coef SE Coef T PConstant 306.5 114.3 2.68 0.020Price -24.98 10.83 -2.31 0.040Advertising 74.13 25.97 2.85 0.014

S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2%

Analysis of Variance

Source DF SS MS F PRegression 2 29460 14730 6.54 0.012Residual Error 12 27033 2253Total 14 56493

Source DF Seq SSPrice 1 11100Advertising 1 18360

Multiple Coefficient of Determination

(continued)

.442r2adj

44.2% of the variation in pie sales is explained by the variation in price and advertising, taking into account the sample size and number of independent variables

Page 19: Multiple Regression ppt

Chap 14-19

Is the Model Significant?

F Test for Overall Significance of the Model Shows if there is a linear relationship between all

of the X variables considered together and Y Use F-test statistic Hypotheses:

H0: β1 = β2 = … = βk = 0 (no linear relationship)

H1: at least one βi ≠ 0 (at least one independent variable affects Y)

Page 20: Multiple Regression ppt

Chap 14-20

F Test for Overall Significance

Test statistic:

where F has (numerator) = k and(denominator) = (n – k - 1)

degrees of freedom

1knSSE

kSSR

MSEMSRF

Page 21: Multiple Regression ppt

Chap 14-21

(continued)

F Test for Overall Significance

With 2 and 12 degrees of freedom

P-value for the F Test

Regression Analysis: Pie Sales versus Price, Advertising

The regression equation isPie Sales = 307 - 25.0 Price + 74.1 Advertising

Predictor Coef SE Coef T PConstant 306.5 114.3 2.68 0.020Price -24.98 10.83 -2.31 0.040Advertising 74.13 25.97 2.85 0.014

S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2%

Analysis of Variance

Source DF SS MS F PRegression 2 29460 14730 6.54 0.012Residual Error 12 27033 2253Total 14 56493

Source DF Seq SSPrice 1 11100Advertising 1 18360

6.542253

14730MSEMSRF

Page 22: Multiple Regression ppt

Chap 14-22

H0: β1 = β2 = 0

H1: β1 and β2 not both zero

= .05df1= 2 df2 = 12

Test Statistic:

Decision:

Conclusion:

Since F test statistic is in the rejection region (p-value < .05), reject H0

There is evidence that at least one independent variable affects Y

0 = .05

F.05 = 3.885Reject H0Do not

reject H0

6.54MSEMSRF

Critical Value:

F = 3.885

F Test for Overall Significance(continued)

F

Page 23: Multiple Regression ppt

Chap 14-23

Are Individual Variables Significant?

Use t tests of individual variable slopes Shows if there is a linear relationship between

the variable Xj and Y

Hypotheses: H0: βj = 0 (no linear relationship)

H1: βj ≠ 0 (linear relationship does exist between Xj and Y)

Page 24: Multiple Regression ppt

Chap 14-24

Are Individual Variables Significant?

H0: βj = 0 (no linear relationship)

H1: βj ≠ 0 (linear relationship does exist between xj and y)

Test Statistic:

(df = n – k – 1)

jb

j

S0b

t

(continued)

Page 25: Multiple Regression ppt

Chap 14-25

(continued)

Are Individual Variables Significant?

Regression Analysis: Pie Sales versus Price, Advertising

The regression equation isPie Sales = 307 - 25.0 Price + 74.1 Advertising

Predictor Coef SE Coef T PConstant 306.5 114.3 2.68 0.020Price -24.98 10.83 -2.31 0.040Advertising 74.13 25.97 2.85 0.014

S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2%

Analysis of Variance

Source DF SS MS F PRegression 2 29460 14730 6.54 0.012Residual Error 12 27033 2253Total 14 56493

Source DF Seq SSPrice 1 11100Advertising 1 18360

t-value for Price is t = -2.31, with p-value 0.04

t-value for Advertising is t = 2.85, with p-value 0.014

Page 26: Multiple Regression ppt

Chap 14-26

d.f. = 15-2-1 = 12 = .05

t/2 = 2.1788

Inferences about the Slope: t Test Example

H0: βi = 0

H1: βi 0

The test statistic for each variable falls in the rejection region (p-values < .05)

There is evidence that both Price and Advertising affect pie sales at = .05

From Minitab output:

Reject H0 for each variable

  Coef SE Coef T PPrice -24.98 10.83 -2.31 0.04Advertising 74.13 25.97 2.85 0.014

Decision:

Conclusion:Reject H0Reject H0

/2=.025

-tα/2Do not reject H0

0 tα/2

/2=.025

-2.1788 2.1788

Page 27: Multiple Regression ppt

Chap 14-27

Confidence Interval Estimate for the Slope

Confidence interval for the population slope βj

Example: Form a 95% confidence interval for the effect of changes in price (X1) on pie sales:

-24.98 ± (2.1788)(10.83)

So the interval is (-48.576 , -1.374)

(This interval does not contain zero, so price has a significant effect on sales)

jb1knj Stb where t has (n – k – 1) d.f.

d.f. = 15-2-1 = 12 = .05

t/2 = 2.1788

  Coef SE CoefPrice -24.98 10.83Advertising 74.13 25.97

Page 28: Multiple Regression ppt

Chap 14-28

Confidence Interval Estimate for the Slope

Confidence interval for the population slope βi

Interpretation:

Weekly sales are estimated to be reduced by between 1.37 to 48.58 pies for each increase of $1 in the selling price

(continued)

Example: Form a 95% confidence interval for the effect of changes in price (X1) on pie sales:

-24.98 ± (2.1788)(10.83)

So the interval is (-48.576 , -1.374)

Page 29: Multiple Regression ppt

Chap 14-29

Assumptions of Regression

Use the acronym LINE: Linearity

The underlying relationship between X and Y is linear

Independence of Errors Error values are statistically independent

Normality of Error Error values (ε) are normally distributed for any given value of

X

Equal Variance (Homoscedasticity) The probability distribution of the errors has constant variance

Page 30: Multiple Regression ppt

Chap 14-30

Residual Analysis

The residual for observation i, ei, is the difference between its observed and predicted value

Check the assumptions of regression by examining the residuals

Examine for linearity assumption Evaluate independence assumption Evaluate normal distribution assumption Examine for constant variance for all levels of X

(homoscedasticity)

Graphical Analysis of Residuals Can plot residuals vs. X

iii YYe

Page 31: Multiple Regression ppt

Chap 14-31

Residual Analysis for Linearity

Not Linear Linear

x

resi

dual

s

x

Y

x

Y

x

resi

dual

s

Page 32: Multiple Regression ppt

Chap 14-32

Residual Analysis for Independence

Not IndependentIndependent

X

Xresi

dual

s

resi

dual

s

X

resi

dual

s

Page 33: Multiple Regression ppt

Chap 14-33

Residual Analysis for Normality

Percent

Residual

A normal probability plot of the residuals can be used to check for normality:

-3 -2 -1 0 1 2 3

0

100

Page 34: Multiple Regression ppt

Chap 14-34

Residual Analysis for Equal Variance

Non-constant variance Constant variance

x x

Y

x x

Y

resi

dual

s

resi

dual

s

Page 35: Multiple Regression ppt

Chap 14-35

Examples of Minitab Plots

Page 36: Multiple Regression ppt

Chap 14-36

Used when data are collected over time to detect if autocorrelation is present

Autocorrelation exists if residuals in one time period are related to residuals in another period

Measuring Autocorrelation:The Durbin-Watson Statistic

Page 37: Multiple Regression ppt

Chap 14-37

Autocorrelation

Autocorrelation is correlation of the errors (residuals) over time

Violates the regression assumption that residuals are random and independent

Time (t) Residual Plot

-15-10

-505

1015

0 2 4 6 8

Time (t)

Res

idua

ls

Here, residuals show a cyclic pattern, not random. Cyclical patterns are a sign of positive autocorrelation

Page 38: Multiple Regression ppt

Chap 14-38

The Durbin-Watson Statistic

n

1i

2i

n

2i

21ii

e

)ee(D

The possible range is 0 ≤ D ≤ 4

D should be close to 2 if H0 is true

D less than 2 may signal positive autocorrelation, D greater than 2 may signal negative autocorrelation

The Durbin-Watson statistic is used to test for autocorrelation

H0: residuals are not correlated

H1: positive autocorrelation is present

Page 39: Multiple Regression ppt

Chap 14-39

Testing for Positive Autocorrelation

Calculate the Durbin-Watson test statistic = D (The Durbin-Watson Statistic can be found using Excel or Minitab)

Decision rule: reject H0 if D < dL

H0: positive autocorrelation does not exist

H1: positive autocorrelation is present

0 dU 2dL

Reject H0 Do not reject H0

Find the values dL and dU from the Durbin-Watson table (for sample size n and number of independent variables k)

Inconclusive

Page 40: Multiple Regression ppt

Chap 14-40

Suppose we have the following time series data:

Is there autocorrelation?

y = 30.65 + 4.7038x R2 = 0.8976

0

20

40

60

80

100

120

140

160

0 5 10 15 20 25 30

Tim e

Sale

s

Testing for Positive Autocorrelation

(continued)

Page 41: Multiple Regression ppt

Chap 14-41

Example with n = 25:

Durbin-Watson CalculationsSum of Squared Difference of Residuals 3296.18

Sum of Squared Residuals 3279.98

Durbin-Watson Statistic 1.00494

y = 30.65 + 4.7038x R2 = 0.8976

0

20

40

60

80

100

120

140

160

0 5 10 15 20 25 30

Tim eSa

les

Testing for Positive Autocorrelation

(continued)

1.004943279.983296.18

e

)e(eD n

1i

2i

n

2i

21ii

Page 42: Multiple Regression ppt

Chap 14-42

Minitab Output

Page 43: Multiple Regression ppt

Chap 14-43

Here, n = 25 and there is k = 1 one independent variable Using the Durbin-Watson table, dL = 1.29 and dU = 1.45 D = 1.00494 < dL = 1.29, so reject H0 and conclude that

significant positive autocorrelation exists Therefore the linear model is not the appropriate model

to forecast sales

Testing for Positive Autocorrelation

(continued)

Decision: reject H0 since

D = 1.00494 < dL

0 dU=1.45 2dL=1.29Reject H0 Do not reject H0Inconclusive

Page 44: Multiple Regression ppt

Chap 14-44

Influence Analysis

Several methods can be used to measure the influence of individual observations:

The hat matrix elements hi

The Studentized deleted residuals ti

Cook’s distance statistic Di

Page 45: Multiple Regression ppt

Chap 14-45

The Hat Matrix Elements hi

The hat matrix diagonal element for observation i, denoted hi, reflects the possible influence of Xi on the regression equation

If potentially influential observations are present, you may need to reevaluate the need to keep them in the model

Rule: if hi > 2(k + 1)/n , then Xi is an influential observation and is a candidate for removal from the model

Page 46: Multiple Regression ppt

Chap 14-46

The Studentized Deleted Residuals ti

The Studentized deleted residual measures the difference of each Yi from the value predicted by a model that includes all observations except observation i

Expressed as a t statistic

Page 47: Multiple Regression ppt

Chap 14-47

The Studentized Deleted Residuals ti

The Studentized deleted residual:

where ei = the residual for observation i k = number of independent variablesSSE = error sum of squares hi = hat matrix diagonal element for observation i

If ti > tn-k-2 or ti < - tn-k-2 (using a two tail test at = 0.10), then observation i is highly influential and a candidate for removal

(continued)

2ii

ii e)hSSE(11knet

Page 48: Multiple Regression ppt

Chap 14-48

Cook’s distance statistic Di

Cook’s distance statistic Di is based on both hi and the Studentized residual

Used to decide whether an observation flagged by either the hi or ti criterion is unduly affecting the model

Page 49: Multiple Regression ppt

Chap 14-49

Cook’s distance statistic Di

Cook’s Di statistic

where ei = the residual for observation i k = number of independent variables SSE = error sum of squared hi = hat matrix diagonal element for observation i

If Di > Fk+1 , n-k-1 at = 0.05, then the observation is highly influential on the regression equation and is a candidate for removal

(continued)

2i

i2i

i )h(1h

MSE k eD

Page 50: Multiple Regression ppt

Chap 14-50

Collinearity

Collinearity: High correlation exists among two or more independent variables

This means the correlated variables contribute redundant information to the multiple regression model

Page 51: Multiple Regression ppt

Chap 14-51

Collinearity

Including two highly correlated independent variables can adversely affect the regression results No new information provided Can lead to unstable coefficients (large

standard error and low t-values) Coefficient signs may not match prior

expectations

(continued)

Page 52: Multiple Regression ppt

Chap 14-52

Some Indications of Strong Collinearity

Incorrect signs on the coefficients Large change in the value of a previous

coefficient when a new variable is added to the model

A previously significant variable becomes non-significant when a new independent variable is added

The estimate of the standard deviation of the model increases when a variable is added to the model

Page 53: Multiple Regression ppt

Chap 14-53

Detecting Collinearity (Variance Inflationary Factor)

VIFj is used to measure collinearity:

If VIFj > 5, Xj is highly correlated with the other independent variables

where R2j is the coefficient of determination of

variable Xj with all other X variables

2j

j R11VIF

Page 54: Multiple Regression ppt

Chap 14-54

Example: Pie Sales

Sales = b0 + b1 (Price)

+ b2 (Advertising)

WeekPie

SalesPrice

($)Advertising

($100s)1 350 5.50 3.32 460 7.50 3.33 350 8.00 3.04 430 8.00 4.55 350 6.80 3.06 380 7.50 4.07 430 4.50 3.08 470 6.40 3.79 450 7.00 3.5

10 490 5.00 4.011 340 7.20 3.512 300 7.90 3.213 440 5.90 4.014 450 5.00 3.515 300 7.00 2.7

Recall the multiple regression equation of chapter 13:

Page 55: Multiple Regression ppt

Chap 14-55

Detecting Collinearity in MinitabStat / Regression / Regression …

Click on the “Options” boxSelect the “Variance inflationary factors” box

Regression Analysis: Pie Sales versus Price, Advertising

The regression equation isPie Sales = 307 - 25.0 Price + 74.1 Advertising

Predictor Coef SE Coef T P VIFConstant 306.5 114.3 2.68 0.020Price -24.98 10.83 -2.31 0.040 1.0Advertising 74.13 25.97 2.85 0.014 1.0

S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2%

Output for the pie sales example:

VIFs are < 5 There is no evidence of collinearity between Price and Advertising

Page 56: Multiple Regression ppt

Chap 14-56

Chapter Summary

Developed the multiple regression model Tested the significance of the multiple

regression model Discussed adjusted r2

Discussed using residual plots to check model assumptions

Tested individual regression coefficients