Upload
justin-benson
View
227
Download
2
Embed Size (px)
Citation preview
Hypothesis tests for slopes in multiple linear regression model
Using the general linear test and sequential sums of squares
An example
Study on heart attacks in rabbits
• An experiment in 32 anesthetized rabbits subjected to an infarction (“heart attack”)
• Three experimental groups:– Hearts cooled to 6º C within 5 minutes of
occluded artery (“early cooling”)– Hearts cooled to 6º C within 25 minutes of
occluded artery (“late cooling”)– Hearts not cooled at all (“no cooling”)
Study on heart attacks in rabbits
• Measurements made at end of experiment:– Size of the infarct area (in grams)– Size of region at risk for infarction (in grams)
• Primary research question:– Does the mean size of the infarcted area differ
among the three treatment groups – no cooling, early cooling, late cooling – when controlling for the size of the region at risk for infarction?
A potential regression model
iiiii xxxy 3322110
where …
• yi is size of infarcted area (in grams) of rabbit i
• xi1 is size of the region at risk (in grams) of rabbit i
• xi2 = 1 if early cooling of rabbit i, 0 if not
• xi3 = 1 if late cooling of rabbit i, 0 if not
and … the independent error terms i follow a normal distribution with mean 0 and equal variance 2.
The estimated regression function
ELC
1.51.00.5
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Size of Area at Risk (grams)
Siz
e o
f In
farc
ted
Are
a (g
ram
s)
Early
Late
Control
The regression equation is InfSize = - 0.135 + 0.613 AreaSize - 0.243 X2 - 0.0657 X3
Possible hypothesis tests for slopes
#1. Is the regression model containing all three predictors useful in predicting the size of the infarct?
0 oneleast at :
0: 3210
iAH
H
#2. Is the size of the infarct significantly (linearly) related to the area of the region at risk?
0:
0:
1
10
AH
H
Possible hypothesis tests for slopes
#3. (Primary research question) Is the size of the infarct area significantly (linearly) related to the type of treatment after controlling for the size of the region at risk for infarction?
0 oneleast at :
0: 320
iAH
H
Linear regression’sgeneral linear test
An aside
Three basic steps
• Define a (larger) full model.
• Define a (smaller) reduced model.
• Use an F statistic to decide whether or not to reject the smaller reduced model in favor of the larger full model.
The full model
For simple linear regression, the full model is:
iii xy 10
The full model (or unrestricted model) is the model thought to be most appropriate for the data.
The full model
54321
22
18
14
10
6
High school gpa
Co
llege
ent
ranc
e te
st s
core
xYEY 10
ii xY 10
The full model
75706560
4
3
2
Height (inches)
Gra
de p
oin
t ave
rage
xYEY 10
ii xY 10
The reduced model
The reduced model (or restricted model) is the model described by the null hypothesis H0.
For simple linear regression, the null hypothesis is H0: β1 = 0. Therefore, the reduced model is:
iiY 0
The reduced model
54321
25
15
5
High school gpa
Co
llege
ent
ranc
e te
st s
core
0 YEY
iiY 0
The reduced model
756555
4
3
2
Height (inches)
Gra
de p
oin
t ave
rage
0 YEY
iiY 0
The general linear test approach
• “Fit the full model” to the data.– Obtain least squares estimates of β0 and β1.
– Determine error sum of squares – “SSE(F).”
• “Fit the reduced model” to the data.– Obtain least squares estimate of β0.
– Determine error sum of squares – “SSE(R).”
The general linear test approach
756555
4
3
2
Height (inches)
Gra
de p
oin
t ave
rage
xyF 001.095.2ˆ
015.3ˆ yyR
5028.7ˆ)( 2 ii yyFSSE
5035.7)( 2 yyRSSE i
The general linear test approach
504030
200
150
100
Latitude (at center of state)
Mo
rtal
ity
88.152ˆ yyR
xyF 98.5389ˆ
536372 yyRSSE i
17173ˆ 2 ii yyFSSE
The general linear test approach
• Compare SSE(R) and SSE(F). • SSE(R) is always larger than (or same as) SSE(F).
– If SSE(F) is close to SSE(R), then variation around fitted full model regression function is almost as large as variation around fitted reduced model regression function.
– If SSE(F) and SSE(R) differ greatly, then the additional parameter(s) in the full model substantially reduce the variation around the fitted regression function.
How close is close?
The test statistic is a function of SSE(R)-SSE(F):
FFR df
FSSE
dfdf
FSSERSSEF
)()()(*
The degrees of freedom (dfR and dfF) are those associated with the reduced and full model error sum of squares, respectively.
Reject H0 if F* is large (or if the P-value is small).
But for simple linear regression, it’s just the same F test as before
FFR df
FSSE
dfdf
FSSERSSEF
)()()(*
1ndfR
2ndfF
SSTORSSE )(
SSEFSSE )(
MSE
MSR
n
SSE
nn
SSESSTOF
221*
The formal F-test for slope parameter β1
Null hypothesis H0: β1 = 0Alternative hypothesis HA: β1 ≠ 0
Test statisticMSE
MSRF *
P-value = What is the probability that we’d get an F* statistic as large as we did, if the null hypothesis is true?
The P-value is determined by comparing F* to an F distribution with 1 numerator degree of freedom and n-2 denominator degrees of freedom.
Example: Alcoholism and muscle strength?
• Report on strength tests for a sample of 50 alcoholic men– x = total lifetime dose of alcohol (kg per kg of
body weight)– y = strength of deltoid muscle in man’s non-
dominant arm
0 10 20 30 40
10
20
30
alcohol
stre
ngthReduced Model Fit
32.1224)(1
2
n
ii YYRSSE
164.20ˆ yyR
Fit the reduced model
0 10 20 30 40
10
20
30
alcohol
stre
ngth
Full Model Fit
27.720ˆ)(1
2
n
iii YYFSSE
xyF 3.037.26ˆ
Fit the full model
The ANOVA table
Analysis of Variance
Source DF SS MS F PRegression 1 504.04 504.040 33.5899 0.000Error 48 720.27 15.006 Total 49 1224.32 SSE(R)=SSTO SSE(F)=SSE
There is a statistically significant linear association between alcoholism and arm strength.
Sequential (or extra) sums of squares
Another aside
What is a sequential sum of squares?
• It can be viewed in either of two ways:– It is the reduction in the error sum of squares
(SSE) when one or more predictor variables are added to the model.
– Or, it is the increase in the regression sum of squares (SSR) when one or more predictor variables are added to the model.
Notation
• The error sum of squares (SSE) and regression sum of squares (SSR) depend on what predictors are in the model.
• So, note what variables are in the model.– SSE(X1) denotes the error sum of squares when
X1 is the only predictor in the model
– SSR(X1, X2) denotes the regression sum of squares when X1 and X2 are both in the model
Notation
• The sequential sum of squares of adding:– X2 to the model in which X1 is the only predictor
is denoted SSR(X2 | X1)– X1 to the model in which X2 is the only predictor
is denoted SSR(X1 | X2)– X1 to the model in which X2 and X3 are predictors
is denoted SSR(X1 | X2, X3)– X1 and X2 to the model in which X3 is the only
predictor is denoted SSR(X1, X2 | X3)
Allen Cognitive Level (ACL) Study
• David and Riley (1990) investigated relationship of ACL test to level of psychopathology in a set of 69 patients in a hospital psychiatry unit:– Response y = ACL score
– x1 = vocabulary (Vocab) score on Shipley Institute of Living Scale
– x2 = abstraction (Abstract) score on Shipley Institute of Living Scale
– x3 = score on Symbol-Digit Modalities Test (SDMT)
Regress y = ACL on x1 = VocabThe regression equation is ACL = 4.23 + 0.0298 Vocab...Analysis of Variance
Source DF SS MS F PRegression 1 2.6906 2.6906 4.47 0.038Residual Error 67 40.3590 0.6024Total 68 43.0496
6906.21 XSSR 3590.401 XSSE
0496.43)( 1 XSSTO
Regress y = ACL on x1 = Vocab and x3 = SDMT
The regression equation isACL = 3.85 - 0.0068 Vocab + 0.0298 SDMT...Analysis of VarianceSource DF SS MS F PRegression 2 11.7778 5.8889 12.43 0.000Residual Error 66 31.2717 0.4738Total 68 43.0496
Source DF Seq SSVocab 1 2.6906SDMT 1 9.0872
7778.11, 31 XXSSR 2717.31, 31 XXSSE
0496.43),( 31 XXSSTO
The sequential sum of squares SSR(X3 | X1)
SSR(X3 | X1) is the reduction in the error sum of squares when X3 is added to the model in which X1 is the only predictor:
),()(| 31113 XXSSEXSSEXXSSR
0873.92717.313590.40| 13 XXSSR
The sequential sum of squares SSR(X3 | X1)
SSR(X3 | X1) is the increase in the regression sum of squares when X3 is added to the model in which X1 is the only predictor:
)(),(| 13113 XSSRXXSSRXXSSR
0872.96906.27778.11| 13 XXSSR
The sequential sum of squares SSR(X3 | X1)
The regression equation isACL = 3.85 - 0.0068 Vocab + 0.0298 SDMT...Analysis of VarianceSource DF SS MS F PRegression 2 11.7778 5.8889 12.43 0.000Residual Error 66 31.2717 0.4738Total 68 43.0496
Source DF Seq SSVocab 1 2.6906SDMT 1 9.0872
0872.9| 13 XXSSR 6906.21 XSSR
Regress y = ACL on x3 = SDMT
(Order in which predictors are added determine the “Seq SS” you get.)
The regression equation isACL = 3.75 + 0.0281 SDMT...Analysis of Variance
Source DF SS MS F PRegression 1 11.680 11.680 24.95 0.000Residual Error 67 31.370 0.468Total 68 43.050
680.113 XSSR 370.313 XSSE
050.43)( 3 XSSTO
Regress y = ACL on x3 = SDMT and x1 = Vocab
(Order in which predictors are added determine the “Seq SS” you get.)
7778.11, 31 XXSSR 2717.31, 31 XXSSE0496.43),( 31 XXSSTO
The regression equation isACL = 3.85 + 0.0298 SDMT - 0.0068 Vocab...Analysis of VarianceSource DF SS MS F PRegression 2 11.7778 5.8889 12.43 0.000Residual Error 66 31.2717 0.4738Total 68 43.0496
Source DF Seq SSSDMT 1 11.6799Vocab 1 0.0979
The sequential sum of squares SSR(X1 | X3)
SSR(X1 | X3) is the reduction in the error sum of squares when X1 is added to the model in which X3 is the only predictor:
),()(| 31331 XXSSEXSSEXXSSR
0983.02717.31370.31| 31 XXSSR
The sequential sum of squares SSR(X1 | X3)
SSR(X1 | X3) is the increase in the regression sum of squares when X1 is added to the model in which X3 is the only predictor:
)(),(| 33131 XSSRXXSSRXXSSR
0978.0680.117778.11| 31 XXSSR
Regress y = ACL on x3 = SDMT and x1 = Vocab
(Order in which predictors are added determine the “Seq SS” you get.)
The regression equation isACL = 3.85 + 0.0298 SDMT - 0.0068 Vocab...Analysis of VarianceSource DF SS MS F PRegression 2 11.7778 5.8889 12.43 0.000Residual Error 66 31.2717 0.4738Total 68 43.0496
Source DF Seq SSSDMT 1 11.6799Vocab 1 0.0979
0979.0| 31 XXSSR 6799.113 XSSR
More sequential sums of squares(Regress y on x3, x1, x2)
The regression equation isACL = 3.95 + 0.0274 SDMT - 0.0174 Vocab + 0.0122 Abstract...Analysis of VarianceSource DF SS MS F PRegression 3 12.3009 4.1003 8.67 0.000Residual Error 65 30.7487 0.4731Total 68 43.0496
Source DF Seq SSSDMT 1 11.6799Vocab 1 0.0979Abstract 1 0.5230
0979.0| 31 XXSSR
6799.113 XSSR
5230.0,| 312 XXXSSR
Two- (or three- or more-) degree of freedom sequential sums of squares
The regression equation isACL = 3.95 + 0.0274 SDMT - 0.0174 Vocab + 0.0122 Abstract...Analysis of VarianceSource DF SS MS F PRegression 3 12.3009 4.1003 8.67 0.000Residual Error 65 30.7487 0.4731Total 68 43.0496
Source DF Seq SSSDMT 1 11.6799Vocab 1 0.0979Abstract 1 0.5230
0979.0| 31 XXSSR 5230.0,| 312 XXXSSR 6209.0|, 321 XXXSSR
),,()(|, 3213321 XXXSSEXSSEXXXSSR 6213.07487.30370.31|, 321 XXXSSR
The hypothesis tests for the slopes
Possible hypothesis tests for slopes
#1. Is the regression model containing all three predictors useful in predicting the size of the infarct?
0 oneleast at :
0: 3210
iAH
H
#2. Is the size of the infarct significantly (linearly) related to the area of the region at risk?
0:
0:
1
10
AH
H
Possible hypothesis tests for slopes
#3. (Primary research question) Is the size of the infarct area significantly (linearly) related to the type of treatment upon controlling for the size of the region at risk for infarction?
0 oneleast at :
0: 320
iAH
H
Testing all slope parameters are 0
Full model
iiiii xxxy 3322110
SSEFSSE )( 4ndfF
Reduced model
iiY 0
SSTORSSE )( 1ndfR
Testing all slope parameters are 0
The general linear test statistic:
FFR df
FSSE
dfdf
FSSERSSEF
*
becomes the usual overall F-test:
MSE
MSR
n
SSESSRF
43*
Testing all slope parameters are 0
Use overall F-test and P-value reported in ANOVA table.The regression equation isInfSize = - 0.135 + 0.613 AreaSize - 0.243 X2 - 0.0657 X3...Analysis of VarianceSource DF SS MS F PRegression 3 0.95927 0.31976 16.43 0.000Residual Error 28 0.54491 0.01946Total 31 1.50418
0 oneleast at :
0: 3210
iAH
H
Testing one slope is 0,say β1 = 0
Full model
iiiii xxxy 3322110
321 ,,)( XXXSSEFSSE 4ndfF
Reduced model
iiii xxy 33220
32 ,)( XXSSERSSE 3ndfR
Testing one slope is 0,say β1 = 0
The general linear test statistic:
FFR df
FSSE
dfdf
FSSERSSEF
*
becomes a partial F-test:
4
,,
1
,| 321321*
n
XXXSSEXXXSSRF
321
321*
,,
),|(
XXXMSE
XXXMSRF
Equivalence of t-testto partial F-test for one slope
Since there is only one numerator degree of freedom in the partial F-test for one slope, it is equivalent to the t-test.
),1(2
pnpn Ft
The t-test is a test for the marginal significance of the x1 predictor after x2 and x3 have been taken into account.
The regression equation isInfSize = - 0.135 - 0.2430 X2 - 0.0657 X3 + 0.613 AreaSize
Predictor Coef SE Coef T PConstant -0.1345 0.1040 -1.29 0.206X2 -0.24348 0.06229 -3.91 0.001X3 -0.06566 0.06507 -1.01 0.322AreaSize 0.6127 0.1070 5.72 0.000
S = 0.1395 R-Sq = 63.8% R-Sq(adj) = 59.9%
Analysis of Variance
Source DF SS MS F PRegression 3 0.95927 0.31976 16.43 0.000Residual Error 28 0.54491 0.01946Total 31 1.50418
Source DF Seq SSX2 1 0.29994X3 1 0.02191AreaSize 1 0.63742
Equivalence of the t-test to the partial F-test
7554.32
01946.0
63742.0
4
,,
1
,| 321321*
n
XXXSSEXXXSSRF
The t-test:
72.5* t 001.0...000.0 Pand
The partial F-test:
F distribution with 1 DF in numerator and 28 DF in denominator x P( X <= x ) 32.7554 1.0000
*22* 7184.3272.5 Ft
The regression equation isInfSize = - 0.135 + 0.613 AreaSize - 0.243 X2 - 0.0657 X3
Predictor Coef SE Coef T PConstant -0.1345 0.1040 -1.29 0.206AreaSize 0.6127 0.1070 5.72 0.000X2 -0.24348 0.06229 -3.91 0.001X3 -0.06566 0.06507 -1.01 0.322
S = 0.1395 R-Sq = 63.8% R-Sq(adj) = 59.9%
Analysis of Variance
Source DF SS MS F PRegression 3 0.95927 0.31976 16.43 0.000Residual Error 28 0.54491 0.01946Total 31 1.50418
Source DF Seq SSAreaSize 1 0.62492X2 1 0.31453X3 1 0.01981
Testing whether two slopes are 0, say β2 = β3 = 0
Full model
321 ,,)( XXXSSEFSSE 4ndfF
Reduced model
iii xy 110
1)( XSSERSSE 2ndfR
iiiii xxxy 3322110
Testing whether two slopes are 0, say β2 = β3 = 0
The general linear test statistic:
FFR df
FSSE
dfdf
FSSERSSEF
*
becomes a partial F-test:
4
,,
2
|, 321132*
n
XXXSSEXXXSSRF
),,(
)|,(
321
132*
XXXMSE
XXXMSRF
The regression equation isInfSize = - 0.135 + 0.613 AreaSize - 0.243 X2 - 0.0657 X3
Predictor Coef SE Coef T PConstant -0.1345 0.1040 -1.29 0.206AreaSize 0.6127 0.1070 5.72 0.000X2 -0.24348 0.06229 -3.91 0.001X3 -0.06566 0.06507 -1.01 0.322
S = 0.1395 R-Sq = 63.8% R-Sq(adj) = 59.9%
Analysis of Variance
Source DF SS MS F PRegression 3 0.95927 0.31976 16.43 0.000Residual Error 28 0.54491 0.01946Total 31 1.50418
Source DF Seq SSAreaSize 1 0.62492X2 1 0.31453X3 1 0.01981
Testing whether β2 = β3 = 0
4
,,
2
|, 321132*
n
XXXSSEXXXSSRF
59.801946.0
2
01981.031453.0*
F
F distribution with 2 DF in numerator and 28 DF in denominator
x P( X <= x ) 8.5900 0.9988
05.00012.09988.01 P