Upload
cordelia-butler
View
215
Download
2
Embed Size (px)
Citation preview
Regression Model Building
Predicting Number of Crew Members of Cruise Ships
Data Description
• n=158 Cruise Ships• Dependent Variable – Crew Size (100s)• Potential Predictor Variables
Age (2013 – Year Built) Tonnage (1000s of Tons) Passengers (100s) Length (100s of feet) Cabins (100s) Passenger Density (Passengers/Space)
Data – First 20 CasesShip Cruise Line Age Tonnage Pssngrs Length Cabins PassDens CrewJ ourney Azamara 6 30.277 6.94 5.94 3.55 42.64 3.55Quest Azamara 6 30.277 6.94 5.94 3.55 42.64 3.55Celebration Carnival 26 47.262 14.86 7.22 7.43 31.8 6.7Conquest Carnival 11 110 29.74 9.53 14.88 36.99 19.1Destiny Carnival 17 101.353 26.42 8.92 13.21 38.36 10Ecstasy Carnival 22 70.367 20.52 8.55 10.2 34.29 9.2Elation Carnival 15 70.367 20.52 8.55 10.2 34.29 9.2Fantasy Carnival 23 70.367 20.56 8.55 10.22 34.23 9.2Fascination Carnival 19 70.367 20.52 8.55 10.2 34.29 9.2Freedom Carnival 6 110.239 37 9.51 14.87 29.79 11.5Glory Carnival 10 110 29.74 9.51 14.87 36.99 11.6Holiday Carnival 28 46.052 14.52 7.27 7.26 31.72 6.6Imagination Carnival 18 70.367 20.52 8.55 10.2 34.29 9.2Inspiration Carnival 17 70.367 20.52 8.55 10.2 34.29 9.2Legend Carnival 11 86 21.24 9.63 10.62 40.49 9.3Liberty* Carnival 8 110 29.74 9.51 14.87 36.99 11.6Miracle Carnival 9 88.5 21.24 9.63 10.62 41.67 10.3Paradise Carnival 15 70.367 20.52 8.55 10.2 34.29 9.2Pride Carnival 12 88.5 21.24 9.63 11.62 41.67 9.3Sensation Carnival 20 70.367 20.52 8.55 10.2 34.29 9.2
Full Model (6 Predictors, 7 Parameters, n=158)Regression Statistics
Multiple R 0.9615R Square 0.9245Adjusted R Square 0.9215Standard Error 0.9819Observations 158
ANOVAdf SS MS F Significance F
Regression 6 1781.5 296.9 308.0 0.0000Residual 151 145.6 1.0Total 157 1927.1
CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Intercept -0.52134 1.05703 -0.493 0.6226 -2.610 1.567Age -0.01254 0.01420 -0.884 0.3783 -0.041 0.016Tonnage 0.01324 0.01189 1.113 0.2673 -0.010 0.037Pssngrs -0.14976 0.04759 -3.147 0.0020 -0.244 -0.056Length 0.40348 0.11445 3.525 0.0006 0.177 0.630Cabins 0.80163 0.08922 8.985 0.0000 0.625 0.978PassDens -0.00066 0.01581 -0.042 0.9669 -0.032 0.031
Backward Elimination – Model Based AIC (minimize)
ModelModel ln 2 parms(Model) constant
145.57Full Model (7 Parms, constant=0) 158 ln 2(7) 1.055
158
145.57Round 2: 158 ln 2(6) 0.943 Round 3:
158
SSEAIC n
n
AIC
AIC AIC
146.39158 ln 2(5) 2.062
158
FullMod Df SS RSS AIC Round2-passdens 1 0.002 145.57 -0.943 - age 1 0.815 146.39 -2.062-age 1 0.753 146.32 -0.13 <none> 145.57 -0.943-tonnage 1 1.195 146.77 0.347 - tonnage 1 2.007 147.58 -0.78<none> 145.57 1.055 - length 1 12.069 157.64 9.641-passengers 1 9.548 155.12 9.092 - passengers 1 14.027 159.6 11.591-length 1 11.98 157.55 11.551 - cabins 1 79.556 225.13 65.944-cabins 1 77.821 223.39 66.721
Round3<none> 146.39 -2.062- tonnage 1 3.866 150.25 0.056- length 1 11.739 158.13 8.126- passengers 1 14.275 160.66 10.64--cabins 1 78.861 225.25 64.028
Forward Selection (AIC Based)
1927.08TOTAL 1927.08 Null Model 158 ln 2(1) 397.18
158SS AIC
Null Model Df SS RSS AIC Round2 Df SS RSS AIC+ cabins 1 1742.21 184.88 28.82 + length 1 22.9636 161.91 9.8661+ tonnage 1 1658.03 269.05 88.1 + passdens 1 14.9541 169.92 17.4948+ passengers 1 1614.23 312.86 111.94 + tonnage 1 12.5135 172.36 19.748+ length 1 1546.6 380.49 142.86 + passengers 1 7.0656 177.81 24.6647+ age 1 542.66 1384.42 346.93 + age 1 5.4442 179.43 26.0989+ passdens 1 46.6 1880.48 395.32 <none> 184.88 28.8215<none> 1927.08 397.18
Round3 Df SS RSS AIC Round4 Df SS RSS AIC+ passengers 1 11.6609 150.25 0.0565 + tonnage 1 3.8656 146.39 -2.06164+ passdens 1 6.3732 155.54 5.5212 + age 1 2.6733 147.58 -0.77996<none> 161.91 9.8661 + passdens 1 2.5635 147.69 -0.66241+ age 1 1.9702 159.94 9.9317 <none> 150.25 0.0565+ tonnage 1 1.2514 160.66 10.6402
Round5 Df SS RSS AIC<none> 146.39 -2.06164+ age 1 0.81467 145.57 -0.94339+ passdens 1 0.06366 146.32 -0.13037
Stepwise Regression (AIC Based)Null Model Df SS RSS AIC Round2 Df SS RSS AIC+ cabins 1 1742.21 184.88 28.82 + length 1 22.96 161.91 9.87+ tonnage 1 1658.03 269.05 88.1 + passdens 1 14.95 169.92 17.49+ passengers 1 1614.23 312.86 111.94 + tonnage 1 12.51 172.36 19.75+ length 1 1546.6 380.49 142.86 + passengers 1 7.07 177.81 24.66+ age 1 542.66 1384.42 346.93 + age 1 5.44 179.43 26.1+ passdens 1 46.6 1880.48 395.32 <none> 184.88 28.82<none> 1927.08 397.18 - cabins 1 1742.21 1927.08 397.18
Round3 Df SS RSS AIC Round4 Df SS RSS AIC+ passengers 1 11.661 150.25 0.056 + tonnage 1 3.866 146.39 -2.062+ passdens 1 6.373 155.54 5.521 + age 1 2.673 147.58 -0.78<none> 161.91 9.866 + passdens 1 2.563 147.69 -0.662+ age 1 1.97 159.94 9.932 <none> 150.25 0.056+ tonnage 1 1.251 160.66 10.64 - passengers 1 11.661 161.91 9.866- length 1 22.964 184.88 28.821 - length 1 27.559 177.81 24.665- cabins 1 218.571 380.49 142.859 - cabins 1 95.781 246.03 75.974
Round5
<none> 146.39 -2.062+ age 1 0.815 145.57 -0.943+ passdens 1 0.064 146.32 -0.13- tonnage 1 3.866 150.25 0.056- length 1 11.739 158.13 8.126- passengers 1 14.275 160.66 10.64- cabins 1 78.861 225.25 64.028
Summary of Automated Models
• Backward Elimination Drop Passenger Density (AIC drops from 1.055 to -0.943) Drop Age (AIC drops from -0.943 to -2.062) Stop: Keep Tonnage, Passengers, Length, Cabins
• Forward Selection Add Cabins (AIC drops from 397.18 to 28.82) Add Length (AIC drops from 28.82 to 9.8661) Add Passengers (AIC drops from 9.8661 to -0.0565) Add Tonnage (AIC drops from -0.0565 to -2.06) Stop: Keep Tonnage, Passengers, Length, Cabins
• Stepwise – Same as Forward Selection
All Possible (Subset) Regressions
2
2
' Number of parameters (including intercept) in Model
Regression(Model) Residual(Model)Model 1 Goal:Maximize within reason
Total Total
Residual(Model)1Adj- Model 1
' Total
p
SS SSR
SS SS
SSnR
n p SS
22
Goal:Maximize
Residual(Model)Model 2 ' Goal: ' where Residual(Full Model)
Residual(Model)Model ln ln( ) ' constant Goal:Minimize
p p
SSC p n C p s MS
s
SSBIC n n p
n
All Possible (Subset) Regressions (Best 4 per Grp)
#preds Int Age Ton Pass Lngth Cabin PassDen R-Sq Adj-R2 Cp BIC1 1 0 0 0 0 1 0 0.904 0.903 37.772 -360.2381 1 0 1 0 0 0 0 0.86 0.859 125.086 -300.9541 1 0 0 1 0 0 0 0.838 0.837 170.523 -277.1221 1 0 0 0 1 0 0 0.803 0.801 240.675 -246.2012 1 0 0 0 1 1 0 0.916 0.915 15.952 -376.1312 1 0 0 0 0 1 1 0.912 0.911 24.261 -368.5022 1 0 1 0 0 1 0 0.911 0.909 26.792 -366.2492 1 0 0 1 0 1 0 0.908 0.907 32.443 -361.3323 1 0 0 1 1 1 0 0.922 0.921 5.857 -382.8783 1 0 0 0 1 1 1 0.919 0.918 11.341 -377.4133 1 0 1 1 0 1 0 0.918 0.916 14.023 -374.8083 1 1 0 0 1 1 0 0.917 0.915 15.909 -373.0024 1 0 1 1 1 1 0 0.924 0.922 3.847 -381.9334 1 1 0 1 1 1 0 0.923 0.921 5.084 -380.6524 1 0 0 1 1 1 1 0.923 0.921 5.197 -380.5344 1 0 1 0 1 1 1 0.919 0.917 13.056 -372.6315 1 1 1 1 1 1 0 0.924 0.922 5.002 -377.7525 1 0 1 1 1 1 1 0.924 0.922 5.781 -376.9395 1 1 0 1 1 1 1 0.924 0.921 6.24 -376.4625 1 1 1 0 1 1 1 0.92 0.917 14.904 -367.7176 1 1 1 1 1 1 1 0.924 0.921 7 -372.692
BIC
Adj-R2
Cp
Cross-Validation
• Hold-out Sample (Training Sample = 100, Validation = 58) Fit Model on Training Sample, and obtain Regression Estimates Apply Regression Estimates from Training Sample to Validation
Sample X levels for Predicted MSEP = sum(obs-pred)2/n Fit Model on Validation Sample and Compare regression
coefficients with model for Training Sample
• PRESS Statistic (Delete observations 1-at-a-time) Fit model with each observation deleted 1-at-a-time Obtain Residual for each observation when it was deleted PRESS = sum(obs-pred(deleted))2
• K-fold Cross-validation Extension of PRESS to where K groups of cases are deleted Useful for computationally intensive models (not OLS)
Hold-Out Sample – nin = 100 nout = 58Training Sample Estimate Std Err t-stat P-Value(Intercept) -1.1018 0.7735 -1.424 0.1576tonnage 0.0048 0.0118 0.407 0.6851passengers -0.1919 0.0545 -3.525 0.0007length 0.4565 0.1457 3.132 0.0023cabins 0.9506 0.1451 6.551 0.0000
Validation Sample Estimate Std Err t-stat P-Value(Intercept) -0.0970 0.9142 -0.106 0.9159tonnage 0.0286 0.0124 2.303 0.0252passengers -0.1234 0.0582 -2.119 0.0388length 0.2321 0.1917 1.211 0.2313cabins 0.7058 0.1060 6.656 0.0000
Coefficients keep signs, but significance levels change a lot.See Tonnage and Length.
2^ 2
2( )
1
2
' 10.7578 0.0005182738
0.0005182738Percent Bias of MSEP = 100 / 100 0.06838787 (%)
0.7578
Vn
iV iV TiV
MSEP y y Biasn n
Bias MSEP
Testing Bias = 0 from Training data to Validation
-0.02276563 0.8778456
0.87784560.1152668
58
-0.02276563 -0.003405238
0.1152668
No evidence of systematic bias for samples
V
s
ss
n
ts
PRESS Statistic
( }
( }
( }
^ ^ ^ ^
pred 0( ) 1( ) ( )1
2^
pred
1
^^
pred
where regression was fit without case
Compare with Residual for the full model
Note: 1
i i
i i
i i
i i p ii ip
n
ii
iii
ii
Y X X i
PRESS Y Y
PRESSMS
n
Y YY Y
p
where diagonal element of thiip i -1
P = X X'X X'
/ 0.9801 Residual 0.96PRESS n MS
Model appears to be valid, very little difference between PRESS/n and MS(Resid)