Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Simple Regression Model
INSR 260, Spring 2009Bob Stine
1
Overview Simple Regression Model (SRM)
Estimators, terminology
Assumptions
Inference
Prediction
Examples! ! ! ! (both from Stat 102)Promotion responseDiamond values
2
Example: Pricing DiamondsQuestions
How much should I expect to pay for a diamond?How accurate is such an estimate?How do prices depend on the weight of the diamond?
Datan = 180 diamonds, relatively small (less than 0.4 carats)
3
300
400
500
600
700
800
900
1000
1100
Price
0.3 0.4
Carat
Bivariate Fit of Price By Carat
Direct ApproachAverage the price of diamonds of each weight,then connect the dots with lines.
Produces a “smooth” curve curve and an estimate of accuracy, but does not answer the question of how these are related?
4
300
400
500
600
700
800
900
1000
1100
Price
0.3 0.4
Carat
Simple Regression ModelAssociation versus causation
Equation relates observed x and y (n cases)Conditional means:! E Y|X = β0 + β1 X = μy|x
Observations:! ! ! yi = β0 + β1 xi + εi
AssumptionsIndependent observationsEqual variance σ2
Normal distribution around line! yi ~ N(μy|x,σ2)! ! ! ! εi ~ N(0, σ2)
Three parameters: β0 , β1 , σ2
5
Least SquaresCriterion
Find estimates that minimize sum of squared deviations! ! ! mina0,a1 Σ(yi - a0 - a1 xi)2
Solution ! ! b1 = cov(x,y)/var(x)!! b0 = y - b1 x
Fitted values, residualsFitted values (on the line)! ŷ = b0 + b1 xResidual deviations!! ! ! e = y - ŷ
Standard error of regression (estimates σ2)s2 = Σ ei2/(n-2)degrees of freedomaka: root mean squared error (RMSE), SD of residuals
cov(x, y) =P
i(xi−x)(yi−y)n−1
var(x) =P
i(xi−x)2
n−1
6
Goodness of FitR-squared statistic
Square of correlation between Y and XPercentage of “explained” variation
Adjusted R-squaredTakes account of number of observations
R2 = Explained SS
Total SS=
∑i(yi − y)2∑i(yi − y)2
= 1−∑
i e2i∑
i(yi − y)2
R2 = 1 - s2
var(y)
7
Checking AssumptionsScatterplots of Y on X, e on X
Normal quantile plot of residuals
300
400
500
600
700
800
900
1000
1100
Price
0.3 0.4
Carat
-300
-200
-100
0
100
200
300
400
Residual
0.3 0.4
Carat
-300
-200
-100
0
100
200
300
400
5 15 25
Count
.001 .01 .05.10 .25 .50 .75 .90.95 .99 .999
-4 -3 -2 -1 0 1 2 3 4
Normal Quantile Plot
8
InferenceDerived from the standard error of the intercept and slope
Three equivalent methodsConfidence intervalt-statisticp-value
Interpretation of estimates, uncertainty?
Intercept
Carat
Term
-162.1909
2312.3973
Estimate
87.42039
284.5074
Std Error
-1.86
8.13
t Ratio
0.0652
<.0001*
Prob>|t|
Parameter Estimates
9
Var(b1) =σ2
∑i(xi − x)2
PredictionInterpretation
Where is the population regression line? (μY|X)What is the average price of all diamonds that weigh 0.3 carats?
What can be anticipated about a future observation?How much for a specific 0.3 carat diamond?
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
Price
0.3 0.4
Carat
Conf Int for μY|X
Pred Int for Y
E(µy|x − yx)2 =
σ2
(1n
+(x− x)2∑i(xi − x)2
)
σ2
(1 +
1n
+(x− x)2∑i(xi − x)2
)
Effect of extrapolation
10
Prediction/Conf. IntervalEstimates are same for both mean and specific diamond
ŷ != -162.2 + 2312.4 Carats! = -162.2 + 2312.4 (0.3)! = $531.52
Intervals differ in anticipated uncertaintyFor the average, JMP (use Fit Model to save interval)! ! ! $509.92!! to ! $553.14For a single diamond, prediction interval is! ! ! $243.18!! to!! $819.8895% Prediction interval is approximately! ! ! ! ! ŷ ± 2 RMSE
11
0
100
200
300
400
500
600
700
800
900
1000
1100
Price
0.293 0.298 0.303
Carat
Diamond Example
-300
-200
-100
0
100
200
300
400
Residual
0.3 0.4
Carat
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
Pri
ce
0.3 0.4
Carat
Linear Fit
Price = -162.1909 + 2312.3973*Carat
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.270671
0.266573
145.7105
542.8333
180
Summary of Fit
Intercept
Carat
Term
-162.1909
2312.3973
Estimate
87.42039
284.5074
Std Error
-1.86
8.13
t Ratio
0.0652
<.0001*
Prob>|t|
Parameter Estimates
Linear Fit
Bivariate Fit of Price By Carat
Interpretation?Comments?Problems?
12
More Diamonds
0
100000
200000
300000
400000
500000
600000
Pri
ce
0 1 2 3 4 5 6 7 8 9 10 11
Carat
Linear Fit
Price = -22655.47 + 32538.514*Carat
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.652987
0.652639
34109.01
21957.11
1000
Summary of Fit
Intercept
Carat
Term
-22655.47
32538.514
Estimate
1491.049
750.8501
Std Error
-15.19
43.34
t Ratio
<.0001*
<.0001*
Prob>|t|
Parameter Estimates
Linear Fit
Bivariate Fit of Price By Carat
-100000
0
100000
200000
300000
400000
Residual
0 1 2 3 4 5 6 7 8 9 10 11
Carat
13
Estimate of cost/carat?Negative intercept?
Possible to fix?
Log-Log Scale
14
Same expanded sample, but on a log-log scale
Clearly a better fit, but what does it mean?
1000600
400
100006000
4000
2000
10000060000
40000
20000
200000
300000
500000
Price
10.70.50.3 1075432
Carat
Log(Price) = 8.5485138 + 1.9571815*Log(Carat)
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.950228
0.950178
0.377488
8.415271
1000
Summary of Fit
Intercept
Log(Carat)
Term
8.5485138
1.9571815
Estimate
0.011976
0.014179
Std Error
713.79
138.03
t Ratio
0.0000*
0.0000*
Prob>|t|
Parameter Estimates
Transformed Fit Log to Log
Promotion Response
0
50
100
150
200
250
300
350
400
450
Sale
s
0 1 2 3 4 5 6 7 8
Display Feet
Linear Fit
Transformed Fit to Log
Sales = 93.032311 + 39.75648*Display Feet
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.711975
0.705574
51.59123
268.13
47
Summary of Fit
Intercept
Display Feet
Term
93.032311
39.75648
Estimate
18.22782
3.769509
Std Error
5.10
10.55
t Ratio
<.0001*
<.0001*
Prob>|t|
Parameter Estimates
Linear Fit
Sales = 83.560256 + 138.62089*Log(Display Feet)
Intercept
Log(Display Feet)
Term
83.560256
138.62089
Estimate
14.41344
9.833914
Std Error
5.80
14.10
t Ratio
<.0001*
<.0001*
Prob>|t|
Parameter Estimates
Transformed Fit to Log
Bivariate Fit of Sales By Display Feet
Same model, different scales on the axes
0
50
100
150
200
250
300
350
400
450
Sale
s
0 0.5 1 1.5 2
Log Feet
Linear Fit
Sales = 83.560256 + 138.62089*Log Feet
Intercept
Log Feet
Term
83.560256
138.62089
Estimate
14.41344
9.833914
Std Error
5.80
14.10
t Ratio
<.0001*
<.0001*
Prob>|t|
Parameter Estimates
Linear Fit
Bivariate Fit of Sales By Log Feet
15
What is the interpretation of the coefficient of the log in this model?
Extrapolation
0
100
200
300
400
500
600
700
Sale
s
0 5 10 15 20
Display Feet
Sales = 376.69522 - 329.70421*Recip(Display Feet)
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.826487
0.822631
40.04298
268.13
47
S um m ary of F it
Intercept
Recip(Display Feet)
T erm
376.69522
-329.7042
E stim ate
9.439455
22.51988
S td E rror
39.91
-14.64
t R atio
<.0001*
<.0001*
Prob > |t|
Param eter E stim ates
T ran sform ed F it to R eciprocal
Sales = 83.560256 + 138.62089*Log(Display Feet)
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.815349
0.811246
41.3082
268.13
47
S um m ary of F it
Intercept
Log(Display Feet)
T erm
83.560256
138.62089
E stim ate
14.41344
9.833914
S td E rror
5.80
14.10
t R atio
<.0001*
<.0001*
Prob > |t|
Param eter E stim ates
T ran sform ed F it to L og
Comparable R2, RMSE but very different extrapolations and
implications for sales.
16
Summary Simple Regression Model (SRM)
Least squares estimators
Role of assumptions
Methods of inference: tests and intervals
Importance of transformations, particularly logs
Prediction, role of extrapolation
17