Multiple Regression

1

Multiple RegressionMultiple Regression

Chapter 17

2

Introduction

• In this chapter we extend the simple linear regression model, and allow for any number of independent variables.

• We expect to build a model that fits the data better than the simple linear regression model.

3

Weight

Calories consumed

Introduction• We all believe that weight is affected by the amount of calories

consumed. Yet, the actual effect is different from one individual to another.

• Therefore, a simple linear relationship leaves much unexplained error.

4

Weight

Calories consumed

Introduction

Click to to continue

In an attempt to reduce the unexplained errors, we’ll adda second explanatory (independent) variable

5

Weight

Calories consumed

Weight = 0 + 1Calories + 2Height +

Introduction

• If we believe a person’s height explains his/her weight too, we can add this variable to our model.

• The resulting Multiple regression model is shown:

6

• We shall use computer printout to – Assess the model

• How well it fits the data• Is it useful• Are any required conditions violated?

– Employ the model• Interpreting the coefficients• Making predictions using the prediction equation• Estimating the expected value of the dependent variable

Introduction

7

Dependent variable Independent variables

Random error variable

17.1 Model and Required Conditions

Coefficients

• We allow k independent variables to potentially explain the dependent variable

y = 0 + 1x1+ 2x2 + …+ kxk +

8

• The error is normally distributed.• The mean is equal to zero and the standard deviation is

constant ( for all values of y. • The errors are independent.

Model Assumptions – Required conditions for

9

– If the model assessment indicates good fit to the data, use it to interpret the coefficients and generate predictions.

– Assess the model fit using statistics obtained from the sample.

– Diagnose violations of required conditions. Try to remedy problems when identified.

17.2 Estimating the Coefficients and Assessing the Model

• The procedure used to perform regression analysis:– Obtain the model coefficients and statistics using a

statistical software.

10

• Example 1 Where to locate a new motor inn?– La Quinta Motor Inns is planning an expansion.– Management wishes to predict which sites are likely to be

profitable.– Several areas where predictors of profitability can be identified

are:• Competition• Market awareness• Demand generators• Demographics• Physical quality

Estimating the Coefficients and Assessing the Model, Example

11

Profitability

Competition Market awareness Customers Community Physical

Operating Margin

Rooms Nearest Officespace Enrollment Income Distance

Distance to downtown.

Medianhouseholdincome.

Distance tothe nearestLa Quinta inn.

Number of hotels/motelsrooms within 3 miles from the site.

X1 x2 x3 x4 x5 x6

CollegeEnrollment


12

• Data were collected from randomly selected 100 inns that belong to La Quinta, and ran for the following suggested model:

Margin = Rooms NearestOfficeCollege + 5Income + 6Disttwn +

INN MARGIN ROOMS NEAREST OFFICE COLLEGE INCOME DISTTWN1 55.5 3203 4.2 549 8 37 2.72 33.8 2810 2.8 496 17.5 35 14.43 49 2890 2.4 254 20 35 2.64 31.9 3422 3.3 434 15.5 38 12.15 57.4 2687 0.9 678 15.5 42 6.96 49 3759 2.9 635 19 33 10.8

INN MARGIN ROOMS NEAREST OFFICE COLLEGE INCOME DISTTWN1 55.5 3203 4.2 549 8 37 2.72 33.8 2810 2.8 496 17.5 35 14.43 49 2890 2.4 254 20 35 2.64 31.9 3422 3.3 434 15.5 38 12.15 57.4 2687 0.9 678 15.5 42 6.96 49 3759 2.9 635 19 33 10.8


13

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.724611R Square 0.525062Adjusted R Square0.49442Standard Error5.512084Observations 100

ANOVAdf SS MS F Significance F

Regression 6 3123.832 520.6387 17.13581 3.03E-13Residual 93 2825.626 30.38307Total 99 5949.458

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 38.13858 6.992948 5.453862 4.04E-07 24.25197 52.02518Number -0.00762 0.001255 -6.06871 2.77E-08 -0.01011 -0.00513Nearest 1.646237 0.632837 2.601361 0.010803 0.389548 2.902926Office Space0.019766 0.00341 5.795594 9.24E-08 0.012993 0.026538Enrollment 0.211783 0.133428 1.587246 0.115851 -0.05318 0.476744Income 0.413122 0.139552 2.960337 0.003899 0.135999 0.690246Distance -0.22526 0.178709 -1.26048 0.210651 -0.58014 0.129622

This is the sample regression equation (sometimes called the prediction equation)This is the sample regression equation (sometimes called the prediction equation)

MARGIN = 38.14 - 0.0076ROOMS +1.65NEAREST + 0.02OFFICE +0.21COLLEGE +0.41INCOME - 0.23DISTTWN

Regression Analysis, Excel OutputLa Quinta

14

Model Assessment -Standard Error of Estimate• A small value of indicates (by definition) a small

variation of the errors around their mean.• Since the mean is zero, small variation of the errors

means the errors are close to zero.• So we would prefer a model with a small standard

deviation of the error rather than a large one. • How can we determine whether the standard deviation

of the error is small/large?

15

• The standard deviation of the error is estimated by the Standard Error of Estimate s:

1knSSE

s

Model Assessment -Standard Error of Estimate

.yThe magnitude of s is judged by comparing it to

16

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.724611R Square 0.525062Adjusted R Square0.49442Standard Error 5.512084Observations 100



CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 38.13858 6.992948 5.453862 4.04E-07 24.25197 52.02518Number -0.00762 0.001255 -6.06871 2.77E-08 -0.01011 -0.00513Nearest 1.646237 0.632837 2.601361 0.010803 0.389548 2.902926Office Space 0.019766 0.00341 5.795594 9.24E-08 0.012993 0.026538Enrollment 0.211783 0.133428 1.587246 0.115851 -0.05318 0.476744Income 0.413122 0.139552 2.960337 0.003899 0.135999 0.690246Distance -0.22526 0.178709 -1.26048 0.210651 -0.58014 0.129622

From the printout, s = 5.5121

Calculating the mean value of y we have 739.45y

Standard Error of Estimate

17

Model Assessment – Coefficient of Determination• In our example it seems s is not particularly small, or is it?

• If is small the model fits the data well, and is considered useful. The usefulness of the model is evaluated by the amount of variability in the ‘y’ values explained by the model. This is done by the coefficient of determination.

• The coefficient of determination is calculated by

As you can see, SSE (thus s) effects the value of r2.

SSTSSESST

SSTSSR

R2

18

SUMMARY OUTPUT




CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 72.45461 7.893104 9.179483 1.11E-14 56.78049 88.12874ROOMS -0.00762 0.001255 -6.06871 2.77E-08 -0.01011 -0.00513NEAREST -1.64624 0.632837 -2.60136 0.010803 -2.90292 -0.38955OFFICE 0.019766 0.00341 5.795594 9.24E-08 0.012993 0.026538COLLEGE 0.211783 0.133428 1.587246 0.115851 -0.05318 0.476744INCOME -0.41312 0.139552 -2.96034 0.003899 -0.69025 -0.136DISTTWN 0.225258 0.178709 1.260475 0.210651 -0.12962 0.580138

Coefficient of Determination

From the printout, R2 = 0.5251that is, 52.51% of the variabilityin the margin values is explainedby this model.

19

• To answer the question we test the hypothesis

H0: 1 = 2 = … = k = 0

H1: At least one i is not equal to zero.

• If at least one i is not equal to zero, the model has some validity.

• We pose the question:Is there at least one independent variable linearly related to the dependent variable?

Testing the Validity of the Model

20

Note, that if all the data points satisfy the linear equation without errors, yi and coincide, and thus SSE = 0. In this case all the variation in y is explained bythe regression (SS(Total) = SSR).

The total variation in y (SS(Total)) can be explained in part by the regression (SSR) while the rest remains unexplained (SSE):SS(Total) = SSR + SSE or

iy

2ii

2i )yΣ(y)yyΣ( ˆˆ 2

i )yΣ(y

If errors exist in small amounts, SSR will be close to SS(Total) and the ratioSSR/SSE will be large. This leads to the F ratio test presented next.

Testing the Validity of the Model

21

Testing for Significance

1knSSE

MSEk

SSRMSR

1knSSE

kSSR

MSEMSR

F

Define the Mean of the Sum of Squares-Regression (MSR)Define the Mean of the Sum of Squares-Error (MSE)

The ratio MSR/MSE is F-distributed

22

Rejection region

F>F,k,n-k-1

Testing for Significance

Note.A Large F results from a large SSR, which indicates much of the variation in y is explained by the regression model; this is when the model is useful. Hence, the null hypothesis (which states that the model is not useful) should be rejected when F is sufficiently large. Therefore, the rejection region has the form of F > F,k,n-k-1

23

SUMMARY OUTPUT




CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 72.45461 7.893104 9.179483 1.11E-14 56.78049 88.12874ROOMS -0.00762 0.001255 -6.06871 2.77E-08 -0.01011 -0.00513NEAREST -1.64624 0.632837 -2.60136 0.010803 -2.90292 -0.38955OFFICE 0.019766 0.00341 5.795594 9.24E-08 0.012993 0.026538COLLEGE 0.211783 0.133428 1.587246 0.115851 -0.05318 0.476744INCOME -0.41312 0.139552 -2.96034 0.003899 -0.69025 -0.136DISTTWN 0.225258 0.178709 1.260475 0.210651 -0.12962 0.580138



k =n–k–1 = n–1 =

Testing the Model Validity of the La Quinta Inns Regression Model

MSE=SSE/(n-k-1)

MSR=SSR/k

MSR/MSE

SSE

SSR

The F ratio test is performed using the ANOVAportion of the regression output

24



k =n–k–1 = n–1 =

If alpha = .05, the critical F isF,k,n-k-1 = F0.05,6,100-6-1=2.17F = 17.14 > 2.17

Also, the p-value = 3.033(10)-13. Clearly, p-value=3.033(10)-13 < 0.05= ,

Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. At least one of the i is not equal to zero, thus, the independent variable associated with it has linear relationship to y. This linear regression model is useful

Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. At least one of the i is not equal to zero, thus, the independent variable associated with it has linear relationship to y. This linear regression model is useful

Testing the Model Validity of the La Quinta Inns Regression Model

25

• b0 = 38.14. This is the y intercept, the value of y when all

the variables take the value zero. Since the data range of all the independent variables do not cover the value zero, do not interpret the intercept.

Interpreting the Coefficients

• Interpreting the coefficients b1 through bk

y = b0 + b1x1 + b2x2 +…+ bkxk

y = b0 + b1(x1+1) + b2x2 +…+ bkxk = b0 + b1x1 + b2x2 +…+ bkxk + b1

26


• b1 = – 0.0076. In this model, for each additional room

within 3 mile of the La Quinta inn, the operating margin

decreases on the average by .0076% (assuming the

other variables are held constant).

27

• b2 = 1.65. In this model, for each additional mile that the nearest competitor is to a La Quinta inn, the average operating margin increases by 1.65% when the other variables are held constant.

• b3 = 0.02. For each additional 1000 sq-ft of office space, the average increase in operating margin will be .02%.

• b4 = 0.21. For each additional thousand students the average

operating margin increases by .21% when the other variables

remain constant.


28

• b5 = 0.41. For additional $1000 increase in median household income, the average operating margin increases by .41%, when the other variables remain constant.

• b6 = - 0.23. For each additional mile to the downtown

center, the average operating margin decreases

by .23%.


29

Test statistic

ib

iis

bt

d.f. = n - k -1

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 38.13858 6.992948 5.453862 4.04E-07 24.25196697 52.02518Number -0.007618 0.00125527 -6.06871 2.77E-08 -0.010110585 -0.00513Nearest 1.646237 0.63283691 2.601361 0.010803 0.389548431 2.902926Office Space0.019766 0.00341044 5.795594 9.24E-08 0.012993078 0.026538Enrollment 0.211783 0.13342794 1.587246 0.115851 -0.053178488 0.476744Income 0.413122 0.1395524 2.960337 0.003899 0.135998719 0.690246Distance -0.225258 0.17870889 -1.26048 0.210651 -0.580138524 0.129622

• The hypothesis for each i is

• Excel printout

H0: i 0H1: i 0

Testing the Coefficients

For example, a test for 1:t = (-.007618-0)/.001255 = -6.068Suppose alpha=.01. t.005,100-6-1=3.39There is sufficient evidence to rejectH0 at 1% significance level. Moreover the p=value of the test is 2.77(10-8). Clearly H0 is strongly rejected. The number of rooms is linearly related to the margin.

30

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 38.13858 6.992948 5.453862 4.04E-07 24.25196697 52.02518Number -0.007618 0.00125527 -6.06871 2.77E-08 -0.010110585 -0.00513Nearest 1.646237 0.63283691 2.601361 0.010803 0.389548431 2.902926Office Space0.019766 0.00341044 5.795594 9.24E-08 0.012993078 0.026538Enrollment 0.211783 0.13342794 1.587246 0.115851 -0.053178488 0.476744Income 0.413122 0.1395524 2.960337 0.003899 0.135998719 0.690246Distance -0.225258 0.17870889 -1.26048 0.210651 -0.580138524 0.129622

• The hypothesis for each i is

• Excel printout

H0: i 0H1: i 0

Testing the Coefficients

See next the interpretation of the p-value results

31

Interpretation

• Interpretation of the regression results for this model– The number of hotel and motel rooms, distance to the nearest

motel, the amount of office space, and the median household income are linearly related to the operating margin

– Students enrollment and distance from downtown are not linearly related to the margin

– Preferable locations have only few other motels nearby, much office space, and the surrounding households are affluent.

32

• The model can be used for making predictions by– Producing prediction interval estimate of the particular

value of y, for given values of xi.– Producing a confidence interval estimate for the

expected value of y, for given values of xi.

• The model can be used to learn about relationships between the independent variables xi, and the dependent variable y, by interpreting the coefficients i

Using the Regression Equation

33

• Predict the average operating margin of an inn at a site with the following characteristics:– 3815 rooms within 3 miles,– Closet competitor 3.4 miles away,– 476,000 sq-ft of office space,– 24,500 college students,– $39,000 median household income,– 3.6 miles distance to downtown center.

MARGIN = 38.14 - 0.0076(3815) -1.646(.9) + 0.02(476) +0.212(24.5) - 0.413(35) + 0.225(11.2) = 37.1%

La Quinta

La Quinta Inns, Predictions

34

• Interval estimates by Excel (Data analysis plus)Prediction Interval

MarginPredicted value = 37.09149

Prediction IntervalLower limit = 25.39527Upper limit = 48.78771

Interval Estimate of Expected ValueLower limit = 32.96972Upper limit = 41.21326

It is predicted that the average operating margin will lie within 25.4% and 48.8%, with 95% confidence.

It is expected the average operating margin of all sites that fit this category falls within 33% and 41.2% with 95% confidence.

The average inn would not be profitable (Less than 50%).

La Quinta Inns, Predictions

35

18.2 Qualitative Independent Variables

• In many real-life situations one or more independent variables are qualitative.

• Including qualitative variables in a regression analysis model is done via indicator variables.

• An indicator variable (I) can assume one out of two values, “zero” or “one”.

1 if a first condition out of two is met0 if a second condition out of two is metI=1 if data were collected before 19800 if data were collected after 19801 if the temperature was below 50o

0 if the temperature was 50o or more1 if a degree earned is in Finance0 if a degree earned is not in Finance

36

Qualitative Independent Variables; Example: Auction Car Price (II)

• Example 2 - continued – Recall: A car dealer wants to predict the auction

price of a car.– The dealer believes now that both odometer reading

and car color are variables that affect a car’s price.– Three color categories are considered:

• White• Silver• Other colors

Note: “Color” is a qualitative variable.

37

• Example 2 - continued

I1 =1 if the color is white0 if the color is not white

I2 =1 if the color is silver0 if the color is not silver

The category “Other colors” is defined by:I1 = 0; I2 = 0


38

• Note: To represent the situation of three possible colors we need only two indicator variables.

• Generally to represent a nominal variable with m possible values, we must create m-1 indicator variables.

How Many Indicator Variables?

39

• Solution– the proposed model is

y = 0 + 1(Odometer) + 2I1 + 3I2 + – The data

Price Odometer I-1 I-214636 37388 1 014122 44758 1 014016 45833 0 015590 30862 0 015568 31705 0 114718 34010 0 1

. . . .

. . . .

White color

Other color

Silver color


Enter the data in Excel as usual

40Odometer

Price

Price = 16.837 - .0591(Odometer) + .0911(0) + .3304(1)

Price = 16.837 - .0591(Odometer) + .0911(1) + .3304(0)

Price = 16.837 - .0591(Odometer) + .0911(0) + .3304(0)

16.837 - .0591(Odometer)

16.928 - .0591(Odometer)

17.167 - .0591(Odometer)

The equation for an“other color” car.

The equation for awhite color car.

The equation for asilver color car.

From Excel we get the regression equationPRICE = 16.837 - .0591(Odometer) + .0911(I-1) + .3304(I-2)

Example: Auction Car Price (II)The Regression Equation

41

From Excel we get the regression equationPRICE = 16701-.0591(Odometer)+.0911(I-1)+.3304(I-2)

A white car sells, on the average, for $91.1 more than a car of the “Other color” category

A silver color car sells, on the average, for $330.4 more than a car of the “Other color” category.

For one additional mile the auction price decreases by 5.91 cents on the average.


Interpreting the equation

42

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.837135R Square 0.700794Adjusted R Square 0.691444Standard Error 0.304258Observations 100



CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 16.83725 0.1971054 85.42255 2.28E-92 16.446 17.2285Odometer -0.059123 0.0050653 -11.67219 4.04E-20 -0.069177 -0.049068I-1 0.091131 0.0728916 1.250224 0.214257 -0.053558 0.235819I-2 0.330368 0.0816498 4.046157 0.000105 0.168294 0.492442

There is insufficient evidenceto infer that a white color car anda car of “other color” sell for adifferent auction price.

There is sufficient evidenceto infer that a silver color carsells for a larger price than acar of the “other color” category.

Car Price-Dummy


43

• Recall: The Dean wanted to evaluate applications for the MBA program by predicting future performance of the applicants.

• The following three predictors were suggested:– Undergraduate GPA– GMAT score– Years of work experience

• It is now believed that the type of undergraduate degree should be included in the model.

Qualitative Independent Variables; Example: MBA Program Admission (II)

Note: The undergraduate degree is qualitative.

44


I1 =1 if B.A.0 otherwise

I2 =1 if B.B.A0 otherwise

The category “Other group” is defined by:I1 = 0; I2 = 0; I3 = 0

I3 =1 if B.Sc. or B.Eng.0 otherwise

45

SUMMARY OUTPUT




CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept 0.189814 1.406734 0.134932 0.892996 -2.60863 2.988258UnderGPA -0.00606 0.113968 -0.05317 0.957728 -0.23278 0.22066GMAT 0.012793 0.001356 9.432831 9.92E-15 0.010095 0.015491Work 0.098182 0.030323 3.237862 0.001739 0.03786 0.158504I-1 -0.34499 0.223728 -1.54199 0.126928 -0.79005 0.100081I-2 0.705725 0.240529 2.934058 0.004338 0.227237 1.184213I-3 0.034805 0.209401 0.166211 0.8684 -0.38176 0.45137


MBA-II

46

Applications in Human Resources Management: Pay-Equity

• Pay-equity can be handled in two different forms:– Equal pay for equal work– Equal pay for work of equal value.

• Regression analysis is extensively employed in cases of equal pay for equal work.

47

Human Resources Management: Pay-Equity

• Example 3– Is there sex discrimination against female managers

in a large firm?– A random sample of 100 managers was selected

and data were collected as follows:• Annual salary• Years of education• Years of experience• Gender

48

• Solution– Construct the following multiple regression model:

y = 0 + 1Education + 2Experience + 3Gender +

– Note the nature of the variables:• Education – quantitative• Experience – quantitative• Gender – qualitative (Gender = 1 if male; =0 otherwise).


49

• Solution – Continued (HumanResource)


SUMMARY OUTPUT



Regression 3 5.74E+10 1.91E+10 72.28735 1.55E-24Residual 96 2.54E+10 2.65E+08Total 99 8.29E+10

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept -5835.1 16082.8 -0.36282 0.71754 -37759.2 26089.02Education 2118.898 1018.486 2.08044 0.040149 97.21837 4140.578Experience 4099.338 317.1936 12.92377 9.89E-23 3469.714 4728.963Gender 1850.985 3703.07 0.499851 0.618323 -5499.56 9201.527

Analysis and Interpretation• The model fits the data quite well.• The model is very useful.• Experience is a variable strongly related to salary.• There is no evidence of sex discrimination.

50

• Solution – Continued (HumanResource)


SUMMARY OUTPUT



Regression 3 5.74E+10 1.91E+10 72.28735 1.55E-24Residual 96 2.54E+10 2.65E+08Total 99 8.29E+10

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept -5835.1 16082.8 -0.36282 0.71754 -37759.2 26089.02Education 2118.898 1018.486 2.08044 0.040149 97.21837 4140.578Experience 4099.338 317.1936 12.92377 9.89E-23 3469.714 4728.963Gender 1850.985 3703.07 0.499851 0.618323 -5499.56 9201.527

Analysis and Interpretation • Further studying the data we find: Average experience (years) for women is 12. Average experience (years) for men is 17

• Average salary for female manager is $76,189 Average salary for male manager is $97,832

51

Review problems

Documents

Multiple Regression