Upload
aspen-bateman
View
254
Download
0
Tags:
Embed Size (px)
Citation preview
Lecture 6Multiple Regression Analysis
Lecture 6
Objectives:
1. Explain and conduct multiple regression analysis in SPSS;
2. Interpret a multiple regression model; and
3. Check the assumptions and conditions of a multiple regression model
For simple regression, the predicted value depends on only one predictor variable:
0 1y b b x
For multiple regression, we write the regression model with more predictor variables:
0 1 1 2 2ˆ k ky b b x b x b x
The Multiple Regression Model
The variation in Bedrooms accounts for only 21% of the variation in Price.
Perhaps the inclusion of another factor can account for a portion of the remaining variation.
Simple Regression Example
Multiple Regression: Include Living Area as a predictor in the regression model.
Now the model accounts for 58% of the variation in Price.
Multiple Regression Example
NOTE: The meaning of the coefficients in multiple regression can be subtly different than in simple regression.
Price = 28986.10 – 7483.10*Bedrooms + 93.84*Living Area
Price drops with increasing bedrooms?
How can this be correct?
Multiple Regression Coefficients
In a multiple regression, each coefficient takes into account all the other predictor(s) in the model.
For houses with similar sized Living Areas:
• more bedrooms means smaller bedrooms and/or smaller common living space.
• Cramped rooms may decrease the value of a house.
Multiple Regression Coefficients
So, what’s the correct answer to the question:
“Do more bedrooms tend to increase or decrease the price of a home?”
Correct answer: “increase” if Bedrooms is the only predictor (“more bedrooms”
may mean “bigger house”, after all!)
“decrease” if Bedrooms increases for fixed Living Area (“more bedrooms” may mean “smaller, more-cramped rooms”)
Multiple Regression Coefficients
Multiple regression coefficients must be interpreted in terms of the other predictors in the model!
Ticket Prices
On a typical night about 15,000 people attend a Concert at Newcastle Entertainment Centre, paying an average price of more than $75 per ticket.
Data for most weeks of 2009-20011 consider the variables Paid Attendance (thousands), # shows, Average Ticket Price ($) to predict Receipts ($million).
Consider the regression model for these variables.Dependent variable is: Receipts($M)R squared = 99.9% R squared (adjusted) = 99.9%s = 0.0931 with 74 degrees of freedomSource Sum of Squares df Mean Square F-ratio P-valueRegression 484.789 3 161.596 18634 < 0.0001Residual 0.641736 74 0.008672
Example
Ticket Prices
Write the regression model for these variables.
Interpret the coefficient of Paid Attendance.
Estimate receipts when paid attendance was 200,000 customer attending 30 shows at an average ticket price of $70.
Is this likely to be a good prediction? Why or why not?
Variable Coeff SE(Coeff) t-ratio P-valueIntercept –18.320 0.3127 –58.6 0.0001Paid Attend 0.076 0.0006 126.7 0.0001# Shows 0.0070 0.0044 1.6 0.116Average 0.24 0.0039 61.5 0.0001Ticket Price
Example
Ticket Prices
Write the regression model for these variables.
Interpret the coefficient of Paid Attendance. If the number of shows and ticket price are fixed, an increase of 1000 customers generates an average increase of $76,000 in receipts.
Estimate receipts when paid attendance was 200,000 customer attending 30 shows at an average ticket price of $70. $13.89 million
Is this likely to be a good prediction? Yes, R2 (adjusted) is 99.9% so this model explains most of the variability in Receipts.
receipts 18.32 0.076 Paid Attendance
0.007 # Shows 0.24Average Ticket Price
Example
Linearity Assumption
Linearity Condition: Check each of the predictors.
Home Prices Example: Linearity Condition is well-satisfied for both Bedrooms and Living Area.
Assumptions and Conditions
Linearity Assumption
Linearity Condition: Also check the residual plot.
Home Prices Example: Linearity Condition is well-satisfied.
Assumptions and Conditions
Independence Assumption
As usual, there is no way to be sure the assumption is satisfied. But, think about how the data were collected to decide if the assumption is reasonable.
Randomization Condition: Does the data collection method introduce any bias?
Assumptions and Conditions
Equal Variance Assumption
Equal Spread Condition: The variability of the errors should be about the same for each predictor.
Use scatterplots to assess the Equal Spread Condition.Residuals vs. Predicted Values: Home Prices
Assumptions and Conditions
Normality Assumption
Nearly Normal Condition: Check to see if the distribution of residuals is unimodal and symmetric.
Home Price Example: The ‘tails” of the distribution appear to be non-normal.
Assumptions and Conditions
Summary of Multiple Regression Model and Condition Checks:
1. Check Linearity Condition with a scatterplot for each predictor. If necessary, consider data re-expression.
2. If the Linearity Condition is satisfied, fit a multiple regression model to the data.
3. Find the residuals and predicted values.
4. Inspect a scatterplot of the residuals against the predicted values. Check for nonlinearity and non-uniform variation.
Assumptions and Conditions
Summary of Multiple Regression Model and Condition Checks:
5. Think about how the data were collected.
Do you expect the data to be independent?
Was suitable randomization utilized?
Are the data representative of a clearly identifiable population?
Is autocorrelation an issue?
Assumptions and Conditions
Summary of Multiple Regression Model and Condition Checks:
6. If the conditions check, feel free to interpret the regression model and use it for prediction.
7. Check the Nearly Normal Condition by inspecting a residual distribution histogram and a Normal plot. If the sample size is large, the Normality is less important for inference. Watch for skewness and outliers.
Assumptions and Conditions
There are several hypothesis tests in multiple regression
Each is concerned with whether the underlying parameters (slopes and intercept) are actually zero.
The hypothesis for slope coefficients:
0 1 2: . . . 0
: at least one 0k
A
H
H
Test the hypothesis with an F-test (a generalization of the t-test to more than one predictor).
Testing the Model
The F-distribution has two degrees of freedom:
k, where k is the number of predictors
n – k – 1 , where n is the number of observations
The F-test is one-sided – bigger F-values mean smaller P-values.
If the null hypothesis is true, then F will be near 1.
Testing the Model
If a multiple regression F-test leads to a rejection of the null hypothesis, then check the t-test statistic for each coefficient:
1
0jn k
j
bt
SE b
Note that the degrees of freedom for the t-test is n – k – 1.
Confidence interval: b
jt
n k 1* SE b
j
Testing the Model
“Tricky” Parts of the t-tests:
SE’s are harder to compute (let technology do it!)
The meaning of a coefficient depends on the other predictors in the model (as we saw in the Home Price example).
If we fail to reject based on it’s t-test, it does not mean that xj has no linear relationship to y. Rather, it means that xj contributes nothing to modeling y after allowing for the other predictors.
0 : 0jH
Testing the Model
In Multiple Regression, it looks like each tells us the effect of its associated predictor, xj.
BUT
The coefficient can be different from zero even when there is no correlation between y and xj.
It is even possible that the multiple regression slope changes sign when a new variable enters the regression.
j
j
Testing the Model
More Ticket Prices
On a typical night about 15,000 people attend a Concert at Newcastle Entertainment Centre, paying an average price of more than $75 per ticket.
Data for most weeks of 2009-20011 consider the variables Paid Attendance (thousands), # shows, Average Ticket Price ($) to predict Receipts($million).
State hypothesis, the test statistic and p-value, and draw a conclusion for an F-test for the overall model.
Dependent variable is: Receipts($M)R squared = 99.9% R squared (adjusted) = 99.9%s = 0.0931 with 74 degrees of freedomSource Sum of Squares df Mean Square F-ratio P-valueRegression 484.789 3 161.596 18634 < 0.0001Residual 0.641736 74 0.008672
Example
More Ticket Prices
State hypothesis for an F-test for the overall model.
State the test statistic and p-value. The F-statistic is the F-ratio = 18634. The p-value is < 0.0001.
Draw a conclusion. The p-value is small, so reject the null hypothesis. At least one of the predictors accounts for enough variation in y to be useful.
H0
:1
2
30
HA
:10,
20, or
30
Example
More Ticket Prices
Since the F-ratio suggests that at least one variable is a useful predictor, determine which of the following variables contribute in the presence of the others. Recall the variables Paid Attendance (thousands), # shows, Average Ticket Price ($) to predict Receipts($million).
Variable Coeff SE(Coeff) t-ratio P-valueIntercept 18.320 0.3127 58.6 0.0001Paid Attend 0.076 0.0006 126.7 0.0001# Shows 0.0070 0.0044 1.6 0.116Average 0.24 0.0039 61.5 0.0001 Ticket Price
Example
More Ticket Prices
Since the F-ratio suggests that at least one variable is a useful predictor, determine which of the following variables contribute in the presence of the others.
Paid Attendance (p = 0.0001) and Average Ticket Price(p = 0.0001) both contribute, even when all other variables are in the model. # Shows however, is not significant(p = 0.116) and should be removed from the model.
Variable Coeff SE(Coeff) t-ratio P-valueIntercept 18.320 0.3127 58.6 0.0001Paid Attend 0.076 0.0006 126.7 0.0001# Shows 0.0070 0.0044 1.6 0.116Average 0.24 0.0039 61.5 0.0001 Ticket Price
Example
R2 in Multiple Regression:
R2 = fraction of the total variation in y accounted for by the model (all the predictor variables included)
Adding new predictor variables to a model never decreases R2 and may increase it.
But each added variable increases the model complexity, which may not be desirable.
Adjusted R2 imposes a “penalty” on the correlation strength of larger models, depreciating their R2 values to account for an undesired increase in complexity.
Example
Adjusted R2 permits a more equitable comparison between models of different sizes.
Multiple Regression in SPSSWords
• Analyze • Regression• Linear
• Select the “Dependent Variable” - use the > button to move into the Dependent: box
• Select the “Independent Variables” - use the > button to move into the Independent(s): box
• Click Statistics• Select Descriptives
1.
3.
2.
MULTIPLE REGRESSION IN SPSSVISUALS
Use the > button to move variable into the Dependent: box
Click Statistics
6.
5.
7.4. Select Variables
Use the > button to move variables into the Independent(s): box
Select Descriptives8.
MULTIPLE REGRESSION IN SPSSVISUALS
This tells us that 99.9% of the variation in Receipts can be explained by our linear regression model
Note: R Square is theCoefficient of multiple determination. It shows thestrength of the association between the Dependent Variable (Y) and two or more Independent Variables (X’s) (From 0 to 1, usually reported as a percentage)
R Square adjusted for the number of Independent variables and the sample size
Is the relationship Significant?
That is, is it strong enough to indicate there is also a relationship in the population?
P value = 0.000 < 0.05
Therefore, the relationship is significant
MULTIPLE REGRESSION IN SPSSOUTPUT
Multiple Regression
Output
Partial Regression Coefficients
These can be used to construct the regression equation for Receipts.
Receipts = a + b1X1 + b2X2 + b3X3 + … + bkXk
Receipts = -18.320 + 0.076*PaidAttendance + 0.007*Shows + 0.238*AvgTicketPrice
If we know the values for the three predictors we can use the regression equation to predict the Receipts value.
Multiple Regression
Output
Testing the significance of the Regression Coefficients
The t and Sig t values given in the Coefficients table tell us which partial regression coefficients (slopes) differ significantly from zero.
In this example the variables that contribute significantly are:
PaidAttendance: t= 120.751, p=0.000
AvgTicketPrice: t=61.014, p=0.000 p-values < 0.05
In this example the variables that contribute significantly are:
PaidAttendance: t= 120.751, p=0.000
AvgTicketPrice: t=61.014, p=0.000p-values < 0.05
The regression equation can therefore be rewritten as:
Receipts = -18.320 + 0.076*PaidAttendance + 0.238*AvgTicketPrice
MODELWHICH PREDICTORS ARE SIGNIFICANT?
Multiple Regression
Output
Partial Regression Coefficients – INTERPRETATION
The partial regression coefficient for “AvgTicketPrice” might be interpreted:
If PaidAttendance is statistically controlled, an increase of 1 in AvgTicketPrice will INCREASE the predicted Receipts Value by 0.238.
Multiple Regression
Output
Standardised Regression Coefficients
• Useful in assessing the relative importance of the predictors and comparing predictors across samples.
• Are coefficients that have been adjusted so that the y intercept (constant) is zero and S.D is 1.
The most important predictor in this model is “PaidAttendance” (Beta = 0.955).
Multiple RegressionInterpretation
Multiple Regression analysis was undertaken to determine the factors that contribute to Receipts (in millions) of the Newcastle Entertainment Centre. Results indicated that the number of paid attendees (t=120.751, p=0.000) and the average ticket price (t=61.014, p=0.000) are significant predictors
to this value. The most important predictor in the model was “PaidAttendance” (Beta = 0.955).
The regression models with the significant predictors is:
Receipts = -18.320 + 0.076*PaidAttendance + 0.238*AvgTicketPrice
Multiple RegressionInterpretation
Receipts = -18.320 + 0.076*PaidAttendance + 0.238*AvgTicketPrice
If the average ticket price is statistically controlled (or fixed), an increase in 1000 paying customers will increase the Receipts value by $76,000.
If the number of paid attendees is statistically controlled, an increase of $1 in the average ticket price will generate an average increase in Receipts of
$238.
This regression model explains 99.9% of the variation in the receipts generated for Newcastle Entertainment Centre. Therefore this is a good
model as nearly all of the variability in Receipts is explained by this model.
Don’t claim to “hold everything else constant” for a single individual. (For the predictors Age and Years of Education, it is impossible for an individual to get a year of education at constant age.)
Don’t interpret regression causally. Statistics assesses correlation, not causality.
Be cautious about interpreting a regression as predictive. That is, be alert for combinations of predictor values that take you outside the ranges of these predictors.
Be careful when interpreting the signs of coefficients in a multiple regression. The sign of a variable canchange depending on which other predictors are in or out of the model. The truth is more subtle and requires that we understand the multiple regression model.
If a coefficient’s t-statistic is not significant, don’t interpret it at all.
Don’t fit a linear regression to data that aren’t straight. Usually, we are satisfied when plots of y against the x’s are straight enough.
Watch out for changing variance in the residuals. The most common check is a plot of the residuals against the predicted values.
Make sure the errors are nearly normal.
Watch out for high-influence points and outliers.
ReviewFundamentals of Quantitative Analysis
Lecture Plan
Week 1: The Role, Collection and Presentation of Quantitative Data in the Business Decision Making Process
Week 2: Examining Data Characteristics: Descriptive Statistics and Data Screening
Week 3: Estimation and Hypothesis Testing
Week 4: Testing for Differences: One sample, Independent and Paired Sample t-tests and ANOVA
Week 5: Testing for Associations: Chi Square, Correlation and Simple Regression Analysis
Week 6: Multiple Regression Analysis
Lecture 1
Objectives:
1. Explain the use of quantitative techniques in business;
2. Discuss the role of quantitative data analysis in the Business Decision Making Process;
3. Explain different sources of quantitative data and how it is collected;
4. Define and describe different types of data;
5. Recognise the potential for using different methods of data presentation in business;
6. Outline the major alternative methods of data presentation;
7. Select between the major alternative methods; and
8. Describe the limitations of data presentation methods.
Lecture 2
Objectives:
1. Describe and display categorical data;
2. Generate and interpret frequency tables, bar charts and pie charts;
3. Generate and interpret histograms to display the distribution of a quantitative variable;
4. Describe the shape, centre and spread of a distribution;
5. Compute descriptive statistics and select between mean/median and standard deviation / interquartile range; and
6. Explain data screening and its purpose, and be able to assess a distribution for normality.
Lecture 3
Objectives:
1. Formulate a null and alternate hypothesis for a question of interest;
2. Explain what a test statistic is;
3. Explain p-values;
4. Describe the reasoning of hypothesis testing;
5. Determine and check assumptions for the sampling distribution model;
6. Compare p-values to a pre-determined significance level to decide whether to reject the null hypothesis;
7. Recognise the value of estimating and reporting the effect size; and
8. Explain Type I and Type II errors when testing hypotheses.
Lecture 4
Objectives:
1. Recognise when to use a one sample t-test, independent samples t-test, paired samples t-test and ANOVA;
2. Explain and check the assumptions and conditions for each test;
3. Run and interpret a 'One sample t-test' to show a sample mean is different from some hypothesised value;
4. Run and interpret an ‘independent samples t-test’ to show the difference between two groups on one attribute;
5. Run and interpret a 'one-way ANOVA' to show the difference between more than two groups on one attribute; and
6. Run and interpret a ‘paired samples t-test’ to show the difference between two attributes as assessed by one sample.
Lecture 5
Objectives: 1. Recognise when a chi-square test of independence is appropriate;
2. Check the assumptions and corresponding conditions for a chi-square test of independence;
3. Run and interpret a chi-square test of independence;
4. Produce and explain a scatter plot to display the relationship between two quantitative variables;
5. Interpret the association between two quantitative variables using a Pearson's correlation coefficient;
6. Model a linear relationship with a least squares regression model;
7. Explain and Check the assumptions and conditions for inference about regression models; and
8. Examine the residuals from a linear model to assess the quality of the model.
Lecture 6
Objectives:
1. Explain and conduct multiple regression analysis in SPSS;
2. Interpret a multiple regression model; and
3. Check the assumptions and conditions of a multiple regression model
Quantitative vs Qualitative Data
• http://www.youtube.com/watch?v=ddx9PshVWXI&feature=related
End of Quantitative Model