View
230
Download
1
Embed Size (px)
Citation preview
Chapter Topics• Types of Regression Models
• Determining the Simple Linear Regression Equation
• Measures of Variation in Regression and Correlation
• Assumptions of Regression and Correlation
• Residual Analysis and the Durbin-Watson Statistic
• Estimation of Predicted Values
• Correlation - Measuring the Strength of the Association
Purpose of Regression and Correlation Analysis
• Regression Analysis is Used Primarily for Prediction
A statistical model used to predict the values of a dependent or response variable based on values of at least one independent or explanatory variable
Correlation Analysis is Used to Measure Strength of the Association Between Numerical Variables
Types of Regression Models
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship
Simple Linear Regression Model
iii XY 10
Y intercept
Slope
• The Straight Line that Best Fit the Data
• Relationship Between Variables Is a Linear Function
Random Error
Dependent (Response) Variable
Independent (Explanatory) Variable
Sample Linear Regression Model
ii XbbY 10
Yi
= Predicted Value of Y for observation i
Xi = Value of X for observation i
b0 = Sample Y - intercept used as estimate ofthe population 0
b1 = Sample Slope used as estimate of the population 1
Simple Linear Regression Equation: Example
You wish to examine the relationship between the square footage of produce stores and its annual sales. Sample data for 7 stores were obtained. Find the equation of the straight line that fits the data best
Annual Store Square Sales
Feet ($000)
1 1,726 3,681
2 1,542 3,395
3 2,816 6,653
4 5,555 9,543
5 1,292 3,318
6 2,208 5,563
7 1,313 3,760
Equation for the Best Straight Line
i
ii
X..
XbbY
4871415163610
From Excel Printout:
CoefficientsIntercept 1636.414726X Variable 1 1.486633657
Graph of the Best Straight Line
0
2000
4000
6000
8000
10000
12000
0 1000 2000 3000 4000 5000 6000
Square Feet
An
nu
al
Sa
les
($00
0)
Y i = 1636.415 +1.487X i
Interpreting the Results
Yi = 1636.415 +1.487Xi
The slope of 1.487 means for each increase of one unit in X, the Y is estimated to increase 1.487units.
For each increase of 1 square foot in the size of the store, the model predicts that the expected annual sales are estimated to increase by $1487.
Measures of Variation:The Sum of Squares
SST = Total Sum of Squares
•measures the variation of the Yi values around their mean Y
SSR = Regression Sum of Squares
•explained variation attributable to the relationship between X and Y
SSE = Error Sum of Squares
•variation attributable to factors other than the relationship between X and Y
_
Measures of Variation: The Sum of Squares
Xi
Y i = b0
+ b1X i
Y
X
Y
SST = (Yi - Y)2
SSE =(Yi - Yi )2
SSR = (Yi - Y)2
_
_
_
df SSRegression 1 30380456.12Residual 5 1871199.595Total 6 32251655.71
The Sum of Squares: Example
Excel Output for Produce Stores
SSR SSE SST
The Coefficient of Determination
SSR regression sum of squares
SST total sum of squaresr2 = =
Measures the proportion of variation that is explained by the independent variable X in the regression model
Coefficients of Determination (r2) and Correlation (r)
r2 = 1, r2 = 1,
r2 = .8, r2 = 0,Y
Yi = b0 + b1Xi
X
^
YYi = b0 + b1Xi
X
^Y
Yi = b0 + b1Xi
X
^
Y
Yi = b0 + b1Xi
X
^
r = +1 r = -1
r = +0.9 r = 0
Standard Error of Estimate
2
n
SSESyx
21
2
n
)YY(n
iii
=
The standard deviation of the variation of observations around the regression line
Regression StatisticsMultiple R 0.9705572R Square 0.94198129Adjusted R Square 0.93037754Standard Error 611.751517Observations 7
Measures of Variation: Example
Excel Output for Produce Stores
r2 = .94 Syx94% of the variation in annual sales can be explained by the variability in the size of the store as measured by square footage
Linear Regression Assumptions
• 1. Normality– Y Values Are Normally Distributed For Each X– Probability Distribution of Error is Normal
• 2. Homoscedasticity (Constant Variance)
• 3. Independence of Errors
For Linear Models
Variation of Errors Around the Regression Line
X1
X2
X
Y
f(e)y values are normally distributed
around the regression line.
For each x value, the “spread” or variance around the regression
line is the same.
Regression Line
Residual Analysis• Purposes
– Examine Linearity – Evaluate violations of assumptions
• Graphical Analysis of Residuals– Plot residuals Vs. Xi values
• Difference between actual Yi & predicted Yi– Studentized residuals:
• Allows consideration for the magnitude of the residuals
Residual Analysis for Homoscedasticity
Heteroscedasticity Homoscedasticity
Using Standardized Residuals
SR
X
SR
X
The Durbin-Watson Statistic•Used when data is collected over time to detect
autocorrelation (Residuals in one time period are related to residuals in another period)
•Measures Violation of independence assumption
n
ii
n
iii
e
)ee(D
1
2
2
21 Should be close to 2.
If not, examine the model for autocorrelation.
Inferences about the Slope: t Test• t Test for a Population Slope
Is a Linear Relationship Between X & Y ?
1
11
bS
bt
•Test Statistic:
n
ii
YXb
)XX(
SS
1
21
and df = n - 2
•Null and Alternative Hypotheses
H0: 1 = 0 (No Linear Relationship) H1: 1 0 (Linear Relationship)
Where
Example: Produce Stores
Data for 7 Stores: Regression Model Obtained:
The slope of this model is 1.487.
Is there a linear relationship between the square footage of a store and its annual sales?
Annual
Store Square Sales Feet ($000)
1 1,726 3,681
2 1,542 3,395
3 2,816 6,653
4 5,555 9,543
5 1,292 3,318
6 2,208 5,563
7 1,313 3,760
Yi = 1636.415 +1.487Xi
t Stat P-valueIntercept 3.6244333 0.0151488X Variable 1 9.009944 0.0002812
• H0: 1 = 0
• H1: 1 0
.05•df 7 - 2 = 7•Critical Value(s):
Test Statistic:
Decision:
Conclusion:
There is evidence of a relationship.t
0 2.5706-2.5706
.025
Reject Reject
.025
From Excel Printout
Reject H0
Inferences about the Slope: t Test
Inferences about the Slope: Confidence Interval Example
Confidence Interval Estimate of the Slopeb1 tn-2 1bS
Excel Printout for Produce Stores
At 95% level of Confidence The confidence Interval for the slope is (1.062, 1.911). Does not include 0.
Conclusion: There is a significant linear relationship between annual sales and the size of the store.
Lower 95% Upper 95%Intercept 475.810926 2797.01853X Variable 11.06249037 1.91077694
Estimation of Predicted Values
Confidence Interval Estimate for XY
The Mean of Y given a particular Xi
n
ii
iyxni
)XX(
)XX(
nStY
1
2
2
21
t value from table with df=n-2
Standard error of the estimate
Size of interval vary according to distance away from mean, X.
Estimation of Predicted Values
Confidence Interval Estimate for Individual Response Yi at a Particular Xi
n
ii
iyxni
)XX(
)XX(
nStY
1
2
2
21
1
Addition of this 1 increased width of interval from that for the mean Y
Interval Estimates for Different Values of X
X
Y
X
Confidence Interval for a individual Yi
A Given X
Confidence Interval for the mean of Y
Y i = b0 + b1X i
_
Example: Produce Stores
Yi = 1636.415 +1.487Xi
Data for 7 Stores:
Regression Model Obtained:
Predict the annual sales for a store with
2000 square feet.
Annual Store Square Sales
Feet ($000)
1 1,726 3,681
2 1,542 3,395
3 2,816 6,653
4 5,555 9,543
5 1,292 3,318
6 2,208 5,563
7 1,313 3,760
Estimation of Predicted Values: Example
Confidence Interval Estimate for Individual Y
Find the 95% confidence interval for the average annual sales for stores of 2,000 square feet
n
ii
iyxni
)XX(
)XX(
nStY
1
2
2
21
Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000)
X = 2350.29 SYX = 611.75 tn-2 = t5 = 2.5706
= 4610.45 980.97
Confidence interval for mean Y
Estimation of Predicted Values: Example
Confidence Interval Estimate for XY
Find the 95% confidence interval for annual sales of one particular store of 2,000 square feet
Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000)
X = 2350.29 SYX = 611.75 tn-2 = t5 = 2.5706
= 4610.45 1853.45
Confidence interval for individual Y
n
ii
iyxni
)XX(
)XX(
nStY
1
2
2
21
1
Correlation: Measuring the Strength of Association
• Answer ‘How Strong Is the Linear Relationship Between 2 Variables?’
• Coefficient of Correlation Used– Population correlation coefficient denoted
(‘Rho’)– Values range from -1 to +1– Measures degree of association
• Is the Square Root of the Coefficient of Determination