Upload
nishil-sheth
View
93
Download
1
Embed Size (px)
Citation preview
Running head: LINEAR REGRESSION ANALYSIS PROJECT 1001399967
LINEAR REGRESSION ANALYSIS PROJECT
NISHIL SHETH
1001399967
INTRODUCTION TO STATISTICS (IE 5317)
UNIVERSITY OF TEXAS AT ARLINGTON
“I _______________________ did not give or receive any assistance on this project, and the report
submitted is wholly my own”
NISHIL SHETH (Nov 30, 2016)
NISHIL SHETH
LINEAR REGRESSION ANALYSIS PROJECT 1001399967
2
ABSTRACT
Simple Linear Regression analysis conducted on a set of linearly trending data to obtain the
certainty in the variability of the observations. The data chosen for the linear regression analysis
of the datasets is Mileage of Heavy Trucks for years 1949 to 2010 (xi) vs Fuel consumption for
mileage of that year (yi). The data obtained is analyzed using the regression tool in Microsoft Excel
and to test whether the data can be used to predict the fuel consumption for a given mileage. The
data is prepared by separating the data as training data and the test data. The regression analysis is
carried out for the training data and finding the predicted value of any value of Mileage within the
training data range for obtaining fuel consumption. Also, a hypothesis testing of the slope
parameter is carried out to clarify the prediction is carried out efficiently. The test data is used for
the prediction analysis of the data. The predicted values using the fitted line is utilized for the test
data prediction. The error of prediction and mean of absolute and relative error in the observation
is calculated and the results are inferred with a conclusion.
LINEAR REGRESSION ANALYSIS PROJECT 1001399967
3
DATA
The data collection process is the most important thing for conducting a simple regression analysis
of a model. The data consists of independent variables (xi) of the dataset and the dependent
variables (yi) which can be used for the simple regression analysis of the model. Linear regression
line can be inferred from the scatter plot.
Figure 1. Scatterplot of Mileage of heavy truck vs Fuel consumption
The data collection is carried out by finding the relationship of at least 50 datasets for dependent
and independent continuous variables (U.S Department of Transportation, 2012). Here, the
independent variable is Mileage of Heavy truck(xi) and the dependent variable is the fuel
consumption(yi). The value of R2 gives the certainty in the variability of the model and it explains
the model. The R2 value must not be too small. As the small value may affect the hypothesis testing
LINEAR REGRESSION ANALYSIS PROJECT 1001399967
4
of the model. The data chosen for the linear regression analysis of the model is dataset of 62
observations for Mileage of Heavy Truck (xi) vs Fuel consumption (yi) shown the scatterplot
above.
From the data of the variables, taking a random variable for each data by using RAND() function
of the excel. Copy the random values with the function and replace it by pasting just the values.
Sort the random numbers from smaller to largest. The training data is just the 80 percent of the
total number of data i.e. 62, so the number of training data is 50 and the number of test data is 20
percent remaining i.e. 12. So, the regression analysis is carried out of the training data. The
arranged training data is shown in the Appendix . Using the data-> data analysis tool. The
regression is carried out for the 50 datasets. The prediction analysis is carried out for the test data.
LINEAR REGRESSION ANALYSIS PROJECT 1001399967
5
Simple Linear Regression Analysis
The input range of X & Y for the training data set is taken for using data analysis->regression tool.
The default confidence is 95%. Tick the graphs such as fit plot graph, residual plot and normal
probability plot. The generated output of the analysis is shown in the figure. Equation of fitted
regression line is y = 0.178x-271.48
Table 1. Excel Output of the Training dataset
Interpretation of the slope: If the
Mileage of the heavy truck
increases by one miles per vehicle,
we predict that the fuel
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.973630211
R Square 0.947955788
Adjusted R
Square 0.946871534
Standard Error 274.3523783
Observations 50
ANOVA
df SS MS F Significance F
Regression 1 65807340.7 65807340.7 874.29276 1.83285E-32
Residual 48 3612922.919 75269.22748
Total 49 69420263.62
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept -271.4825557 118.9314621 -2.282680721 0.0269201 -510.6102872 -32.3548242 -510.610287 -32.35482424
Mileage of
Heavy truck 0.17802848 0.006020895 29.56844198 1.833E-32 0.16592266 0.1901343 0.16592266 0.190134301
LINEAR REGRESSION ANALYSIS PROJECT 1001399967
6
consumption increases by approximately 0.178 gallons per vehicle.
Figure 2. Fitted line plot overlaid on the scatterplot
Interpretation of the intercept: If the Mileage of the heavy truck is zero miles per vehicles, we
predict the fuel consumption is -271.48 gallons per vehicles.
F-test of the training data: Stating the Hypothesis for the Slope parameter: -
H0: β1 = 0 vs H1: β1 ≠ 0. From the ANOVA table of training data, we find that the F value is greater
than F significant value of the data. So, we reject H0.
P- value of f-test: The value of F statistics is 874.2927609. Therefore, the p value of this is
<0.00001. Hence the P value is less than the significance level 0.05. Hence, we reject H0. So, the
slope parameter cannot be zero.
Conclusion: This is a strong conclusion that the slope parameter of the given dataset is not zero.
We have statistically significant evidence at α=0.05 to show that there is a relation between
mileage and fuel consumption of heavy truck. This gives the idea of the prediction of the values
of the training data.
Interpretation of R2 value from the data: From the scatter plot of the data plotted for all the
datasets, the R2 value is 0.9468. This value shows the certainty of the variance of the data. 94.68%
of the variability of fuel consumption is explained by the graph of the model with the independent
lot size.
Residual Plot: The Vertical axis of the residual plot of the graph are the residuals and the
independent variable i.e. Mileage of Heavy Truck for different years is on the horizontal axis. The
LINEAR REGRESSION ANALYSIS PROJECT 1001399967
7
data points are in the scatter form
around(randomly dispersed) the
horizontal axis, this gives the
information that the linear
regression model is appropriate.
Figure 3. Residual Plot of the data
Confidence interval and prediction interval:
Taking a value of x0 within the x-range. Let us take the value of x0 as 10000 mileage from the
training data. The predicted value y = 1508.52 from the equation of the predicted line of the
training data.
Confidence Interval is given by E[Y|x=x0] two sided: y| x=x0 ± t/2, n-2 s.e{y| x=x0}.
Standard Error of slope: s.e{b₁}: 𝒔. 𝒆{𝒃₁} = √𝑴𝑺𝑬
√𝑺𝒙𝒙
The value of MSE (Mean squared error) from the ANOVA table is 75269.2274792751 and the
value of Sxx = ∑xi2 – nx2 = 2076325186.
So, the standard error of b1 is given by s.e{b1} = 65. 0528.Therefore, s.e{y| x=x0} = 65.0528 and
t/2, n-2 = t0.025,48 = 2.011
So, the confidence interval is (1377.69881,1639.3411). Thus, the we are 95% confident that the
value of x0(Mileage of Heavy Truck) lies in this interval.
Prediction interval: Prediction error: p.e{y| x=x0} = √𝑴𝑺𝑬 + ( 𝐬. 𝐞{ 𝐲 | 𝐱 = 𝐱𝟎})𝟐 = 281.959.
LINEAR REGRESSION ANALYSIS PROJECT 1001399967
8
So, the prediction interval: y| x=x0 ± t/2, n-2 p.e{y| x=x0}
Therefore, the prediction interval is for the given value of x0 is (941.4996,2075.54). The predicted
value of x0 lies in this prediction interval. The prediction interval is always larger than the
confidence interval.
LINEAR REGRESSION ANALYSIS PROJECT 1001399967
9
Prediction Analysis and Conclusions
The test data with the corresponding values is shown in the table :
Table 2. Test data set with corresponding predicted values
The table shows the predicted values (fitted line) for the xi of the test data (y), error of Prediction
(y-y), absolute value of P.E(A.P.E), Relative error of Prediction (R.E = P.E/(absolute value of y)),
Absolute Relative Error of Prediction( A.R.E).
MAPE (Mean Absolute Error of Prediction) value from the test data is 307.59327. MAPE is
used to understand the effectiveness of the predicted values to the actual model in the same units
of data. Here the developed model is of the fuel consumption impacted by the Mileage of heavy
truck over the years. The MAPE value gives the error that may occur in the predicted value which
is quite higher. For predicting the Values of the fuel consumption, the corresponding value
Data of the test data set with corresponding values
Year
Mileage
of Heavy
truck(xi)
Fuel
Consump
tion(yi) random
Predicted
values y
Error of
prediction
P.E= yi-y
Absolute
error of
Prediction
(P.E)
Relative
error of
prediction
Absolute
relative
error of
prediction
1960 26609 4174 0.85866 4465.66 -291.6641 291.66405 -0.06988 0.069876
1976 15167 2722 0.87599 2428.67 293.3323 293.33232 0.107764 0.107764
1957 10682 1281 0.87981 1630.21 -349.2121 349.2121 -0.27261 0.272609
1995 10963 1283 0.89824 1680.24 -397.238 397.23796 -0.30962 0.309616
1986 22143 3821 0.90461 3670.59 150.409 150.409 0.039364 0.039364
1993 10769 1288 0.91498 1645.7 -357.7005 357.70053 -0.27772 0.277718
1994 10395 1380 0.93487 1579.12 -199.1181 199.11806 -0.14429 0.144288
2003 19931 3647 0.96242 3276.79 370.2069 370.20693 0.10151 0.10151
1968 28573 4387 0.97485 4815.31 -428.311 428.31104 -0.09763 0.097632
1951 10545 1242 0.9843 1605.82 -363.8223 363.82226 -0.29293 0.292933
1977 27023 4057 0.98496 4539.37 -482.3676 482.36764 -0.1189 0.118898
1966 26014 4352 0.99553 4359.74 -7.737392 7.737392 -0.00178 0.001778
MAPE= 307.59327 MARE= 0.152832
LINEAR REGRESSION ANALYSIS PROJECT 1001399967
10
concerning the Mileage for a Predicted year may have high value of error and thus weakens the
model. The predicted value of fuel consumption is deviated by 307.593 units which gives larger
errors in prediction.
MARE (Mean Absolute relative error of Prediction) value from the test data prediction analysis
is 0.15283. MARE measures the predictive power of y using the observations from the test data.
It measures the average of the relative error of prediction of y for the observations in the test data.
The lower the value of MARE , the higher is the predictive power of y. Mean relative error of
prediction is the average value a relative error of prediction by which a value can differ for the
Mileage of the heavy truck. Here the value of MARE in percentage is 15.28% which is quiet large
indicating the relative error of prediction for different values of Mileage of the test data is larger,
which explains the variability of the observations less accurately. (Thomas Bruckmann, 2014)
This implies that the prediction model for the data of linear regression analysis (Mileage vs Fuel
consumption) is likely to be less accurate due to the higher values of errors MAPE and MARE.
The prediction of the Mileage of Heavy truck should be done considering the errors.
General Conclusion:
The Prediction Model constructed for the simple regression analysis for the Mileage vs Fuel
consumption of Heavy truck for different consecutive years is developed. From the analysis of the
training data, the predicted value is accurate for predicting the fuel consumption within the x-
range of the training data. But for the prediction of test data the prediction errors are quite high
which makes the goodness of fit and the prediction power less accurate. Thus, the model cannot
be used for the prediction of the values due to large values of errors in the data.
LINEAR REGRESSION ANALYSIS PROJECT 1001399967
11
APPENDIX
Table 3. Training data with corresponding fitted values and residual
Table of the training data with the corresponding fiited values and residuals
Observation Year
Mileage of
Heavy
truck(xi)
Fuel
Consumption
(yi)
Predicted Fuel Consumption
(Fitted values) Residuals
1 1970 26602 4477 4464.431077 12.56892
2 1965 10537 1341 1604.403541 -263.404
3 1956 18736 3447 3064.059051 382.9409
4 2009 10768 1303 1645.52812 -342.528
5 1979 26092 4221 4373.636552 -152.637
6 1996 20597 3570 3395.370053 174.6299
7 1973 26514 4315 4448.76457 -133.765
8 1955 26235 4385 4399.094624 -14.0946
9 2006 10693 1333 1632.175984 -299.176
10 1963 23603 3953 3930.523664 22.47634
11 1949 15370 2775 2464.815186 310.1848
12 1964 18502 3380 3022.400386 357.5996
13 2008 26274 4037 4406.037735 -369.038
14 2000 14117 2519 2241.7455 277.2545
15 1954 10576 1293 1611.346652 -318.347
16 1967 27071 4642 4547.926434 94.07357
17 1997 13565 2467 2143.473779 323.5262
18 2004 12789 2294 2005.323679 288.6763
19 1987 9712 1080 1457.530045 -377.53
20 1981 19016 3565 3113.907025 451.093
21 1998 25231 4304 4220.35403 83.64597
22 1983 22550 3967 3743.059675 223.9403
23 1990 28093 4215 4729.871541 -514.872
24 1953 13484 2459 2129.053472 329.9465
25 1971 26262 4309 4403.901393 -94.9014
26 1999 10511 1309 1599.7748 -290.775
27 1975 27032 4218 4540.983323 -322.983
28 1991 10316 1229 1565.059247 -336.059
29 2002 25838 4202 4328.417318 -126.417
30 2001 10702 1328 1633.77824 -305.778
31 1985 25617 4391 4289.073024 101.927
32 1972 18045 3263 2941.041371 321.9586
33 1961 21083 3769 3481.891894 287.1081
34 1959 10851 1387 1660.304484 -273.304
35 1962 10408 1389 1581.437867 -192.438
36 1982 23349 3937 3885.30443 51.69557
37 2010 22485 3736 3731.487823 4.512177
38 1969 12402 2240 1936.426657 303.5733
39 1988 22926 3776 3809.998383 -33.9984
40 1980 16700 3002 2701.593065 300.4069
41 1950 12537 2250 1960.460502 289.5395
42 1952 28290 4398 4764.943151 -366.943
43 2005 15438 2764 2476.921123 287.0789
44 1978 25397 4135 4249.906758 -114.907
45 1992 25373 4210 4245.634074 -35.6341
46 2007 10554 1337 1607.430025 -270.43
47 1989 14995 2708 2398.054506 309.9455
48 1958 24229 4047 4041.969493 5.030507
49 1974 10774 1304 1646.596291 -342.596
50 1984 14780 2657 2359.778383 297.2216
LINEAR REGRESSION ANALYSIS PROJECT 1001399967
12
REFERENCE
1. Thomas Bruckmann, R. M. (2014, September 15-19). MiL Testing of Highly
Configurable Continuous. Retrieved from www.researchgate.net:
https://www.researchgate.net/publication/274567717_MiL_Testing_of_Highly_Configur
able_Continuous_Controllers_Scalable_Search_Using_Surrogate_Models
2. U.S Department of Transportation. (2012, September 27).
http://www.eia.gov/totalenergy/data/annual/showtext.php?t=pTB0208. Retrieved from
http://www.eia.gov:
http://www.eia.gov/totalenergy/data/annual/showtext.php?t=pTB0208