12
Running head: LINEAR REGRESSION ANALYSIS PROJECT 1001399967 LINEAR REGRESSION ANALYSIS PROJECT NISHIL SHETH 1001399967 INTRODUCTION TO STATISTICS (IE 5317) UNIVERSITY OF TEXAS AT ARLINGTON “I _______________________ did not give or receive any assistance on this project, and the report submitted is wholly my own” NISHIL SHETH (Nov 30, 2016) NISHIL SHETH

linear regression project with ecosign

Embed Size (px)

Citation preview

Page 1: linear regression project with ecosign

Running head: LINEAR REGRESSION ANALYSIS PROJECT 1001399967

LINEAR REGRESSION ANALYSIS PROJECT

NISHIL SHETH

1001399967

INTRODUCTION TO STATISTICS (IE 5317)

UNIVERSITY OF TEXAS AT ARLINGTON

“I _______________________ did not give or receive any assistance on this project, and the report

submitted is wholly my own”

NISHIL SHETH (Nov 30, 2016)

NISHIL SHETH

Page 2: linear regression project with ecosign

LINEAR REGRESSION ANALYSIS PROJECT 1001399967

2

ABSTRACT

Simple Linear Regression analysis conducted on a set of linearly trending data to obtain the

certainty in the variability of the observations. The data chosen for the linear regression analysis

of the datasets is Mileage of Heavy Trucks for years 1949 to 2010 (xi) vs Fuel consumption for

mileage of that year (yi). The data obtained is analyzed using the regression tool in Microsoft Excel

and to test whether the data can be used to predict the fuel consumption for a given mileage. The

data is prepared by separating the data as training data and the test data. The regression analysis is

carried out for the training data and finding the predicted value of any value of Mileage within the

training data range for obtaining fuel consumption. Also, a hypothesis testing of the slope

parameter is carried out to clarify the prediction is carried out efficiently. The test data is used for

the prediction analysis of the data. The predicted values using the fitted line is utilized for the test

data prediction. The error of prediction and mean of absolute and relative error in the observation

is calculated and the results are inferred with a conclusion.

Page 3: linear regression project with ecosign

LINEAR REGRESSION ANALYSIS PROJECT 1001399967

3

DATA

The data collection process is the most important thing for conducting a simple regression analysis

of a model. The data consists of independent variables (xi) of the dataset and the dependent

variables (yi) which can be used for the simple regression analysis of the model. Linear regression

line can be inferred from the scatter plot.

Figure 1. Scatterplot of Mileage of heavy truck vs Fuel consumption

The data collection is carried out by finding the relationship of at least 50 datasets for dependent

and independent continuous variables (U.S Department of Transportation, 2012). Here, the

independent variable is Mileage of Heavy truck(xi) and the dependent variable is the fuel

consumption(yi). The value of R2 gives the certainty in the variability of the model and it explains

the model. The R2 value must not be too small. As the small value may affect the hypothesis testing

Page 4: linear regression project with ecosign

LINEAR REGRESSION ANALYSIS PROJECT 1001399967

4

of the model. The data chosen for the linear regression analysis of the model is dataset of 62

observations for Mileage of Heavy Truck (xi) vs Fuel consumption (yi) shown the scatterplot

above.

From the data of the variables, taking a random variable for each data by using RAND() function

of the excel. Copy the random values with the function and replace it by pasting just the values.

Sort the random numbers from smaller to largest. The training data is just the 80 percent of the

total number of data i.e. 62, so the number of training data is 50 and the number of test data is 20

percent remaining i.e. 12. So, the regression analysis is carried out of the training data. The

arranged training data is shown in the Appendix . Using the data-> data analysis tool. The

regression is carried out for the 50 datasets. The prediction analysis is carried out for the test data.

Page 5: linear regression project with ecosign

LINEAR REGRESSION ANALYSIS PROJECT 1001399967

5

Simple Linear Regression Analysis

The input range of X & Y for the training data set is taken for using data analysis->regression tool.

The default confidence is 95%. Tick the graphs such as fit plot graph, residual plot and normal

probability plot. The generated output of the analysis is shown in the figure. Equation of fitted

regression line is y = 0.178x-271.48

Table 1. Excel Output of the Training dataset

Interpretation of the slope: If the

Mileage of the heavy truck

increases by one miles per vehicle,

we predict that the fuel

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.973630211

R Square 0.947955788

Adjusted R

Square 0.946871534

Standard Error 274.3523783

Observations 50

ANOVA

df SS MS F Significance F

Regression 1 65807340.7 65807340.7 874.29276 1.83285E-32

Residual 48 3612922.919 75269.22748

Total 49 69420263.62

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept -271.4825557 118.9314621 -2.282680721 0.0269201 -510.6102872 -32.3548242 -510.610287 -32.35482424

Mileage of

Heavy truck 0.17802848 0.006020895 29.56844198 1.833E-32 0.16592266 0.1901343 0.16592266 0.190134301

Page 6: linear regression project with ecosign

LINEAR REGRESSION ANALYSIS PROJECT 1001399967

6

consumption increases by approximately 0.178 gallons per vehicle.

Figure 2. Fitted line plot overlaid on the scatterplot

Interpretation of the intercept: If the Mileage of the heavy truck is zero miles per vehicles, we

predict the fuel consumption is -271.48 gallons per vehicles.

F-test of the training data: Stating the Hypothesis for the Slope parameter: -

H0: β1 = 0 vs H1: β1 ≠ 0. From the ANOVA table of training data, we find that the F value is greater

than F significant value of the data. So, we reject H0.

P- value of f-test: The value of F statistics is 874.2927609. Therefore, the p value of this is

<0.00001. Hence the P value is less than the significance level 0.05. Hence, we reject H0. So, the

slope parameter cannot be zero.

Conclusion: This is a strong conclusion that the slope parameter of the given dataset is not zero.

We have statistically significant evidence at α=0.05 to show that there is a relation between

mileage and fuel consumption of heavy truck. This gives the idea of the prediction of the values

of the training data.

Interpretation of R2 value from the data: From the scatter plot of the data plotted for all the

datasets, the R2 value is 0.9468. This value shows the certainty of the variance of the data. 94.68%

of the variability of fuel consumption is explained by the graph of the model with the independent

lot size.

Residual Plot: The Vertical axis of the residual plot of the graph are the residuals and the

independent variable i.e. Mileage of Heavy Truck for different years is on the horizontal axis. The

Page 7: linear regression project with ecosign

LINEAR REGRESSION ANALYSIS PROJECT 1001399967

7

data points are in the scatter form

around(randomly dispersed) the

horizontal axis, this gives the

information that the linear

regression model is appropriate.

Figure 3. Residual Plot of the data

Confidence interval and prediction interval:

Taking a value of x0 within the x-range. Let us take the value of x0 as 10000 mileage from the

training data. The predicted value y = 1508.52 from the equation of the predicted line of the

training data.

Confidence Interval is given by E[Y|x=x0] two sided: y| x=x0 ± t/2, n-2 s.e{y| x=x0}.

Standard Error of slope: s.e{b₁}: 𝒔. 𝒆{𝒃₁} = √𝑴𝑺𝑬

√𝑺𝒙𝒙

The value of MSE (Mean squared error) from the ANOVA table is 75269.2274792751 and the

value of Sxx = ∑xi2 – nx2 = 2076325186.

So, the standard error of b1 is given by s.e{b1} = 65. 0528.Therefore, s.e{y| x=x0} = 65.0528 and

t/2, n-2 = t0.025,48 = 2.011

So, the confidence interval is (1377.69881,1639.3411). Thus, the we are 95% confident that the

value of x0(Mileage of Heavy Truck) lies in this interval.

Prediction interval: Prediction error: p.e{y| x=x0} = √𝑴𝑺𝑬 + ( 𝐬. 𝐞{ 𝐲 | 𝐱 = 𝐱𝟎})𝟐 = 281.959.

Page 8: linear regression project with ecosign

LINEAR REGRESSION ANALYSIS PROJECT 1001399967

8

So, the prediction interval: y| x=x0 ± t/2, n-2 p.e{y| x=x0}

Therefore, the prediction interval is for the given value of x0 is (941.4996,2075.54). The predicted

value of x0 lies in this prediction interval. The prediction interval is always larger than the

confidence interval.

Page 9: linear regression project with ecosign

LINEAR REGRESSION ANALYSIS PROJECT 1001399967

9

Prediction Analysis and Conclusions

The test data with the corresponding values is shown in the table :

Table 2. Test data set with corresponding predicted values

The table shows the predicted values (fitted line) for the xi of the test data (y), error of Prediction

(y-y), absolute value of P.E(A.P.E), Relative error of Prediction (R.E = P.E/(absolute value of y)),

Absolute Relative Error of Prediction( A.R.E).

MAPE (Mean Absolute Error of Prediction) value from the test data is 307.59327. MAPE is

used to understand the effectiveness of the predicted values to the actual model in the same units

of data. Here the developed model is of the fuel consumption impacted by the Mileage of heavy

truck over the years. The MAPE value gives the error that may occur in the predicted value which

is quite higher. For predicting the Values of the fuel consumption, the corresponding value

Data of the test data set with corresponding values

Year

Mileage

of Heavy

truck(xi)

Fuel

Consump

tion(yi) random

Predicted

values y

Error of

prediction

P.E= yi-y

Absolute

error of

Prediction

(P.E)

Relative

error of

prediction

Absolute

relative

error of

prediction

1960 26609 4174 0.85866 4465.66 -291.6641 291.66405 -0.06988 0.069876

1976 15167 2722 0.87599 2428.67 293.3323 293.33232 0.107764 0.107764

1957 10682 1281 0.87981 1630.21 -349.2121 349.2121 -0.27261 0.272609

1995 10963 1283 0.89824 1680.24 -397.238 397.23796 -0.30962 0.309616

1986 22143 3821 0.90461 3670.59 150.409 150.409 0.039364 0.039364

1993 10769 1288 0.91498 1645.7 -357.7005 357.70053 -0.27772 0.277718

1994 10395 1380 0.93487 1579.12 -199.1181 199.11806 -0.14429 0.144288

2003 19931 3647 0.96242 3276.79 370.2069 370.20693 0.10151 0.10151

1968 28573 4387 0.97485 4815.31 -428.311 428.31104 -0.09763 0.097632

1951 10545 1242 0.9843 1605.82 -363.8223 363.82226 -0.29293 0.292933

1977 27023 4057 0.98496 4539.37 -482.3676 482.36764 -0.1189 0.118898

1966 26014 4352 0.99553 4359.74 -7.737392 7.737392 -0.00178 0.001778

MAPE= 307.59327 MARE= 0.152832

Page 10: linear regression project with ecosign

LINEAR REGRESSION ANALYSIS PROJECT 1001399967

10

concerning the Mileage for a Predicted year may have high value of error and thus weakens the

model. The predicted value of fuel consumption is deviated by 307.593 units which gives larger

errors in prediction.

MARE (Mean Absolute relative error of Prediction) value from the test data prediction analysis

is 0.15283. MARE measures the predictive power of y using the observations from the test data.

It measures the average of the relative error of prediction of y for the observations in the test data.

The lower the value of MARE , the higher is the predictive power of y. Mean relative error of

prediction is the average value a relative error of prediction by which a value can differ for the

Mileage of the heavy truck. Here the value of MARE in percentage is 15.28% which is quiet large

indicating the relative error of prediction for different values of Mileage of the test data is larger,

which explains the variability of the observations less accurately. (Thomas Bruckmann, 2014)

This implies that the prediction model for the data of linear regression analysis (Mileage vs Fuel

consumption) is likely to be less accurate due to the higher values of errors MAPE and MARE.

The prediction of the Mileage of Heavy truck should be done considering the errors.

General Conclusion:

The Prediction Model constructed for the simple regression analysis for the Mileage vs Fuel

consumption of Heavy truck for different consecutive years is developed. From the analysis of the

training data, the predicted value is accurate for predicting the fuel consumption within the x-

range of the training data. But for the prediction of test data the prediction errors are quite high

which makes the goodness of fit and the prediction power less accurate. Thus, the model cannot

be used for the prediction of the values due to large values of errors in the data.

Page 11: linear regression project with ecosign

LINEAR REGRESSION ANALYSIS PROJECT 1001399967

11

APPENDIX

Table 3. Training data with corresponding fitted values and residual

Table of the training data with the corresponding fiited values and residuals

Observation Year

Mileage of

Heavy

truck(xi)

Fuel

Consumption

(yi)

Predicted Fuel Consumption

(Fitted values) Residuals

1 1970 26602 4477 4464.431077 12.56892

2 1965 10537 1341 1604.403541 -263.404

3 1956 18736 3447 3064.059051 382.9409

4 2009 10768 1303 1645.52812 -342.528

5 1979 26092 4221 4373.636552 -152.637

6 1996 20597 3570 3395.370053 174.6299

7 1973 26514 4315 4448.76457 -133.765

8 1955 26235 4385 4399.094624 -14.0946

9 2006 10693 1333 1632.175984 -299.176

10 1963 23603 3953 3930.523664 22.47634

11 1949 15370 2775 2464.815186 310.1848

12 1964 18502 3380 3022.400386 357.5996

13 2008 26274 4037 4406.037735 -369.038

14 2000 14117 2519 2241.7455 277.2545

15 1954 10576 1293 1611.346652 -318.347

16 1967 27071 4642 4547.926434 94.07357

17 1997 13565 2467 2143.473779 323.5262

18 2004 12789 2294 2005.323679 288.6763

19 1987 9712 1080 1457.530045 -377.53

20 1981 19016 3565 3113.907025 451.093

21 1998 25231 4304 4220.35403 83.64597

22 1983 22550 3967 3743.059675 223.9403

23 1990 28093 4215 4729.871541 -514.872

24 1953 13484 2459 2129.053472 329.9465

25 1971 26262 4309 4403.901393 -94.9014

26 1999 10511 1309 1599.7748 -290.775

27 1975 27032 4218 4540.983323 -322.983

28 1991 10316 1229 1565.059247 -336.059

29 2002 25838 4202 4328.417318 -126.417

30 2001 10702 1328 1633.77824 -305.778

31 1985 25617 4391 4289.073024 101.927

32 1972 18045 3263 2941.041371 321.9586

33 1961 21083 3769 3481.891894 287.1081

34 1959 10851 1387 1660.304484 -273.304

35 1962 10408 1389 1581.437867 -192.438

36 1982 23349 3937 3885.30443 51.69557

37 2010 22485 3736 3731.487823 4.512177

38 1969 12402 2240 1936.426657 303.5733

39 1988 22926 3776 3809.998383 -33.9984

40 1980 16700 3002 2701.593065 300.4069

41 1950 12537 2250 1960.460502 289.5395

42 1952 28290 4398 4764.943151 -366.943

43 2005 15438 2764 2476.921123 287.0789

44 1978 25397 4135 4249.906758 -114.907

45 1992 25373 4210 4245.634074 -35.6341

46 2007 10554 1337 1607.430025 -270.43

47 1989 14995 2708 2398.054506 309.9455

48 1958 24229 4047 4041.969493 5.030507

49 1974 10774 1304 1646.596291 -342.596

50 1984 14780 2657 2359.778383 297.2216

Page 12: linear regression project with ecosign

LINEAR REGRESSION ANALYSIS PROJECT 1001399967

12

REFERENCE

1. Thomas Bruckmann, R. M. (2014, September 15-19). MiL Testing of Highly

Configurable Continuous. Retrieved from www.researchgate.net:

https://www.researchgate.net/publication/274567717_MiL_Testing_of_Highly_Configur

able_Continuous_Controllers_Scalable_Search_Using_Surrogate_Models

2. U.S Department of Transportation. (2012, September 27).

http://www.eia.gov/totalenergy/data/annual/showtext.php?t=pTB0208. Retrieved from

http://www.eia.gov:

http://www.eia.gov/totalenergy/data/annual/showtext.php?t=pTB0208