Welcome to Chapter 14 MBA 541

14-1

Welcome to Chapter 14 MBA 541

BENEDICTINE UNIVERSITY

• Regression and Correlation

• Multiple Regression and Correlation Analysis

• Chapter 14

14-2

Chapter 14

Please, Read Chapter 14 in

Lind before viewing this presentation.

StatisticalTechniques in

Business &Economics

Lind

14-3

Goals

When you have completed this chapter, you will be able to:

• ONE– Describe the relationship between two or more

independent variables and the dependent variable using a multiple regression equation.

• TWO– Compute and interpret the multiple standard error of

estimate and the coefficient of determination.• THREE

– Interpret a correlation matrix.

14-4

Goals

When you have completed this chapter, you will be able to:

• FOUR– Set up and interpret an ANOVA table.

• FIVE– Conduct a test of hypothesis to determine if any of the

set of regression coefficients differ from zero.• SIX

– Conduct a test of hypothesis on each of the regression coefficients.

14-5

Multiple Regression Analysis

• The general multiple regression with k independent variables is given by:

where: a is the Y-intercept, andX1 to Xk are the independent

variables.

• Greek letters are used for a (α) and b (β) when denoting population parameters.

1 1 2 2 k kY ' a b X b X ... b X

14-6

Multiple Regression Analysis

• In the general multiple regression equation, bj is the net change in Y for each unit change in Xj (holding all other values constant) where j = 1 to k.

• Each bj is called a partial regression coefficient, a net regression coefficient, or just a regression coefficient.

• The least squares criterion is used to develop the multiple regression equation.

• Because determining b1, b2, etc. is very tedious, a software package such as Excel or MINITAB is recommended.

14-7

Multiple Standard Errorof Estimate

• The Multiple Standard Error of Estimate is a measure of the effectiveness of the regression equation.

• The formula is:

• It is measured in the same units as the dependent variable.

2

y.12...k

Y Y 's

n k 1

14-8

Multiple Regression and Correlation Assumptions

• Assumptions about Multiple Regression and Correlation:– The independent variables and the dependent variable

have a linear relationship.– The dependent variable is continuous and at least interval

scale.– The variation in (Y-Y’) must be approximately the same

for all values of Y’. When this is the case, differences exhibit homoscedasticity.

– The residuals should follow the normal distribution with a mean of 0.

– Successive values of the dependent variable must be uncorrelated.

14-9

ANOVA Table

• SSR = Regression variation. Variation accounted for by the set of independent variables.

• SSE = Error variation. Variation not accounted for by the independent variables.

• SS Total = Total Variation.

ANOVA TABLE

Source df SS MS

Regression k SSR SSR / k

Error n-(k+1) SSE SSE / [n-(k+1)]

Total n-1 SS Total

2

SSR Y' Y

2SSE Y Y '

2

SST Y Y

14-10

Correlation Matrix• A Correlation Matrix is used to show all the correlation

coefficients between all pairs of the variables.

• The matrix is useful for locating correlated independent variables.

• It shows how strongly each independent variable is correlated with the dependent variable.

CORRELATION MATRIX

CorrelationCoefficients

Cars Advertising Sales Force

Cars 1.000

Advertising 0.808 1.000

Sales Force 0.872 0.537 1.000

14-11

Global Test

• The Global Test is used to investigate whether any of the independent variables have significant coefficients.

• The hypotheses are:H0: β1 = β2 = … = βk = 0

H1: Not all βs are equal to 0

• The test statistic is the F distribution.– The numerator degrees of freedom is k, the number of

independent variables.– The denominator degrees of freedom is n-(k+1), where n

is the sample size.

14-12

Test for Individual Variables

• The Test of Individual Variables is used to determine which independent variables have nonzero regression coefficients.

• The variables that have zero regression coefficients are usually dropped from the analysis.

• The test statistic is the t distribution with n-(k+1) degrees of freedom.

j

j

b

b 0t

s

14-13

Example 1

• A market researcher for Super Dollar Super Markets is studying the yearly amount that families of four or more spend on food.

• Three independent variables are thought to be related to yearly food expenditures (Food).

• These variables are: total family income (Income) in $00, size of the family (Size), and whether the family has children in college (College).

14-14

Example 1 (Continued)

• Notice that the variable, College, is called a dummy or indicator variable. It can take only one of two possible outcomes. That is, a child is a college student or not.

• We usually code one value of the dummy variable as “1” and the other “0”.

• Other examples of dummy variables include: gender, the part is acceptable or unacceptable, the voter will or will not vote for the incumbent governor.

1 2 3Food a b Income b Size b College

14-15

Example 1 (Continued)Family Food Income Size College

1 3900 376 4 0

2 5300 515 5 1

3 4300 516 4 0

4 4900 468 5 0

5 6400 538 6 1

6 7300 626 7 1

7 4900 543 5 0

8 5300 437 4 0

9 6100 608 5 1

10 6400 513 6 1

11 7400 493 6 1

12 5800 563 5 0

14-16

Example 1 (Continued)• Use a computer software package, such as MINITAB or Excel,

to develop a correlation matrix.

• From the analysis provided by MINITAB (as presented on the next slide), write out the regression equation:

• This regression equation can be used to estimate the food expenditures for families of different sizes, with or without college students, and with various incomes.

1 2 3Y ' 954 1.09X 748X 565X

Food $954 $1.09 Income $748 Size $565 College

14-17

Example 1 (Continued)The regression equation isFood = 954 + 1.09 Income + 748 Size + 565 College

Predictor Coef SE Coef T PConstant 954 1581 0.60 0.563Income 1.092 3.153 0.35 0.738Size 748.4 303.0 2.47 0.039Student 564.5 495.1 1.14 0.287

S = 572.7 R-Sq = 80.4% R-Sq(adj) = 73.1%

Analysis of Variance

Source DF SS MS F PRegression 3 10762903 3587634 10.94 0.003Residual Error 8 2623764 327970Total 11 13386667

14-18


• This regression equation indicates that:– Each additional $100 of income per year will increase

the amount spent on food by $1.09 per year.– An additional family member will increase the amount

spent per year on food by $748.– A family with a college student will spend $565 more per

year on food than those without a college student.

• For a family of 4, with no college students, and an income of $50,000 (which is input as 500), this regression equation predicts the food expenditure will be $4,491.

Food $954 $1.09 Income $748 Size $565 College

Food $954 $1.09 500 $748 4 $565 0 $4,491

14-19


• Regarding the Coefficient of Determination:

• From the regression output (shown on Slide 17), we note that the coefficient of determination is 80.4%.

• This means that more than 80% of the variation in the amount spent on food is accounted for by the variables of income, family size, and student status.

14-20


• Regarding the Correlation Matrix:

• The strongest correlation between the dependent variable and an independent variable is between family size and amount spent on food.

• None of the correlations among the independent variables should cause problems. All are between -0.70 and +0.70.

Food Income Size College

Food 1.000

Income 0.587 1.000

Size 0.876 0.609 1.000

College 0.773 0.491 0.743 1.000

14-21


• Conduct a Global Test of Hypothesis to determine if any of the regression coefficients are not zero.

• The hypotheses are:H0: β1 = β2 = β3 = 0

H1: At least one β ≠ 0

• The test statistic is the F distribution.

– H0 is rejected if F > 4.07.

– From the MINITAB output, the computed value of F is 10.94.

– Decision: H0 is rejected.

– Not all the regression coefficients are zero.

14-22


• Conduct an Individual Test to determine which coefficients are not zero.

• For the independent variable, family size, the hypotheses are:H0: β2 = 0

H1: β2 ≠ 0

• Using the 5% level of significance, reject H0 if the p-value < 0.05.

• From the MINITAB output, the only significant variable is Size (family size) using the p-values. The other variables can be omitted from the model.

14-23


• We rerun the analysis using only the significant independent variable, family size.

• The new regression equation is:

• As shown on the next slide, which presents the MINITAB output, the coefficient of determination is 76.8%.

• We dropped two independent variables, and the R-squared term was reduced by only 3.6%.

2Y ' 340 1031X

14-24

Example 1 (Continued)Regression Analysis: Food versus Size

The regression equation isFood = 340 + 1031 Size

Predictor Coef SE Coef T PConstant 339.7 940.7 0.36 0.726Size 1031.0 179.4 5.75 0.000

S = 557.7 R-Sq = 76.8% R-Sq(adj) = 74.4%

Analysis of Variance

Source DF SS MS F PRegression 1 10275977 10275977 33.03 0.000Residual Error 10 3110690 311069Total 11 13386667

14-25

Analysis of Residuals

• A Residual is the difference between the actual value of Y and the predicted value Y’.

• Residuals should be approximately normally distributed. Histograms and stem-and-leaf charts are useful in checking this requirement.

• A plot of the residuals and their corresponding Y’ values is used for showing that there are no trends or patterns in the residuals.

14-26

Residual Plot

Residual Plot Against Estimated Values of Y

4500 75006000

0

-500

500

1000

Y’

Resi

du

als

14-27

Histogram of Residuals

-600 -200 200 600 1000

876543210

Freq

uen

cy

Residuals

Documents

Welcome to Chapter 14 MBA 541