Upload
phuong-nam-nguyen
View
15
Download
0
Embed Size (px)
DESCRIPTION
Multiple linear regression
Citation preview
Prepared by Matricola
Ana Paula Zavadniak 172345
Jacopo Gelli 172769
Marco Delprato 172178
Phuong Nam Nguyen 172558
Data Analysis and Forecasting | December 19, 2014
Department of Economics & Management
University of Trento
GROUP 8 - MAIN
STUDY ON CONSUMPTION PATTERN OF CIGARETTE
GROUP 8 |MAIN PAGE 1
PREVIEW
The main goal of our project is to forecast the consumption of cigarettes in 50 States and the
District of Columbia (51 states in total), basing on six different variables, namely:
Age (x1): Median age of a person living in a state
HS (x2): Percentage of people over 25 years of age in a state who had completed
high school
Income (x3): Per capita personal income for a state (income in dollars)
Black (x4): Percentage of black people living in a state
Female (x5): Percentage of females living in a state
Price (x6): weighted average price (in cents) of a pack of cigarettes in a state.
The dependent variable (Y) is Sales - Number of packs of cigarettes sold in a state on a per
capita basis. For full raw dataset please turn to APPENDIX section on page 14.
Our objective is to construct a multiple regression model that resembles the following format:
= + 1x1 + 2x2 + 3x3 + 4x4 + 5x5 + 6x6
Sales (Y)
Age (X1)
HS (X2)
Income (X3)
Black (X4)
Female (X5)
Price (X6)
GROUP 8 |MAIN PAGE 2
DESCRIPT ION AND DATA ANALYS IS
We first start by plotting the value of Sales against six predictors (Age, HS, Income, Black,
Female, Price). The result came out (Figure1) showing that Female and Black seems to show
some skew. Also we can spot some certain outliers in the graph which may have some
effect on correlation between our variables.
In order to have a clearer view about the relationship between Black and Female with Sales,
it is recommendable to investigate existing outliers, perform logarithm transformation and
then re-plotting the data to see if theres improvement.
GROUP 8 |MAIN PAGE 3
Figure 1: Scatterplot of full dataset
As we investigate the dataset, we identify some of the following values that can be
considered potential outliers: AK, DC and NH.
GROUP 8 |MAIN PAGE 4
In order to access whether these outliers have any influence on significance level of our
model, we first compute regression with the full set of data (no outliers removed)
If we compute the regression removing all existing outliers from our dataset, Adjusted R2 rises
from 0.2282 to 0.2441, which is not enough.
GROUP 8 |MAIN PAGE 5
We run regression model without DC: Adjusted R2 falls down to 0.1321, even worse than the
first scenario. Although DC contains two outliers, we do not think it is rational to remove DC
completely since this variable may play some certain role in our forecasting model.
Instead, we run the regression model without AK and NH: Adjusted R2 = 0.4016, which is the
best result so far
R2 by nature describes the strength of the linear relationship between the independent
variables Xi and dependent variable Y. Decreasing R means that our forecasting model
GROUP 8 |MAIN PAGE 6
becomes less accurate, which contradict with our main goal. Therefore, we only leave AK
and NH out of the forecasting dataset, and keep the remaining unchanged.
This is what the scatter plot looks like after omitting the two outliers AK and NH.
Figure 2: Scatter plot of adjusted data
Again, Black still seem to be skewed. To see what is going on inside the bulk of data, we will
now take logarithm of predictor Black with remaining predictors against dependent variable
Sales.
GROUP 8 |MAIN PAGE 7
Surprisingly, as we do transformation, responsible Adjusted R2 drops down, while p-value
increased, so we decide not to perform any transformation with the data to preserve
models reliability.
CHOICE OF THE MODEL
With seven predictors, it may be a good idea to figure out which are the most suitable ones
to be included in the forecasting model, and which one should be left out. There are several
different methods, which base on the value of R, CV, AIC, BIC Generally the model with
minimum AIC is most likely to be the best model for forecasting since it also lower the value
of CV as the number of observations getting larger. We choose to go with this method, using
command [step] and let R do the calculation of AIC.
\
GROUP 8 |MAIN PAGE 8
As automatically calculated by R, the model with lowest AIC contains four predictors
Age/Income/Black and Price appear to be the optimal choice and values of intercept (0)
and relevant coefficients (i) are generated. This regression model from now on will be
referred to as fitAIC
Our optimal model can now be written as follow:
Sales = 31.5782 + 4.08678(Age) + 0.01701(Income) + 0.55917(Black) 2.4969(Price)
The figures above suggest that the consumption of Sales positively correlated with
Age/Income and percentage of black people in each State, while negatively correlated
with retail price of the product. In particular:
The intercept value is 31.5782 and alone doesnt make sense, but it is an important
part of the model;
1 year increase in people median age will result an average increase of 4.0867 in Sales
number;
1 dollar increase in income corresponds to an average increase of 0.01701 in Sales;
1% increase of black people will result to an average expansion of 0.55917 in Sales;
On the other hand, 1 dollar rise in Price will lead to an average 2.497 falls in Sales
record.
MODEL F I T T ING
Now that we came up with the forecasting model, we need to check whether all four
predictors are useful or we can drop one or several of them. We have 24 = 16 models in total,
all of which are summarized as in table below (the tick [x] indicate the variable to be
included in our model)
GROUP 8 |MAIN PAGE 9
The best one that has lowest CV/AIC and highest Adj. R2 is placed on top. Again, the model
with four predictors Age/Income/Black/Price still the best option, which reconfirm the result
drawn from the AIC method illustrated above.
RESIDUALS DIAGNOSTIC
Now that we have select the regression variables, which contains four inputs, we will now
plot the residuals in order to make sure that the assumptions of the model have been
satisfied (residuals must have zero mean and uncorrelated with each predictor).
GROUP 8 |MAIN PAGE 10
We can see that there is no pattern so the relationship is not nonlinear.
To be on the safe side, we also plot the residuals against the two predictors previously
eliminated (Female and HS). In case any pattern arise, we need to re-added those two
variables to the model.
Again, theres no pattern spotted, so no further action needed in this case.
The next part is to access whether heteroscedasticity occurs by plotting the residuals
against the fitted value
No certain systematic pattern here and the variation in the residuals do not change with the
size of the fitted values. Therefore, there is no need to transform our forecasted variable Sales.
GROUP 8 |MAIN PAGE 11
We continue our analysis plotting the histogram of the residuals:
In our case, the residual seems to be slightly positively skewed, although that it is probably
due to the Washington DC outlier that we choose to not remove.
Now we look at the Q-Q plot:
GROUP 8 |MAIN PAGE 12
Clearly, the data almost fit the normal distribution.
Of course, this is no guarantee that this particular normal distribution is the best distribution for
this data set. Nonetheless, it is a useful tool to visualize the goodness-of-fit of a data set to a
distribution.
Finally, in order to assess the accuracy of our model, we will now pick up some random
values of predictor variable from both inside and outside observation range, plug them into
the model, then compare with predicted values indicated by R.
Internal range: Age = 25; Income = 4000; Black = 15; Price = 40
Sales Yin = 31.57820 + 4.08687*25 + 0.01701*4000 + 0.59917*15 - 2.49690*40 = 110. 3183
External range: Age = 35; Income = 5500; Black = 75; Price = 50
Sales Yex = 31.57820 + 4.08687*35 + 0.01701*5500 + 0.59917*75 - 2.49690*50 = 185.2894
From the results above we draw the conclusion that our model is relatively plausible since the
value of Y taken from corresponding values of X from both internal and external range lies
within the predicted spectrum of our model.
GROUP 8 |MAIN PAGE 13
PRACT ICAL IMPLICAT IO N
Now lets recall of our model
Sales = 31.5782 + 4.08678(Age) + 0.01701(Income) + 0.55917(Black) 2.4969(Price)
From the figures above we have come up with the following inferences:
The number of Sales in 51 States can be predicted base on study on several predictors
including median age of people in each state, consumers income, demographic factor in
which indicate buyers race as well as average price of cigarette (per pack). Two other
values (i.e. proportion of Female users and percentage of user over 25 already completed
High school education) do not seem to have any certain impact on Sales of cigarette and
thus were excluded from the model.
Three variables Age, Income and Black are positively correlated with Sales, while Price has
negative effect on this dependent variable. This suggest the Company to target the suitable
customer segment (senior, wealthy, black) or offer some discount (low down retail price) in
order to boost their Sales record.
GROUP 8 |MAIN PAGE 14
REFERENCES
Reading
Rob J Hyndman and George Athanasopoulos (August 2014), Forecasting: Principles and
Practice, Print Edition, available at www.otexts.com
Osborne, Jason W. & Amy Overbay (2004), The power of outliers (and why researchers
should always check for them), Practical Assessment, Research & Evaluation, 9(6). Retrieved
December 5, 2014 from http://PAREonline.net/getvn.asp?v=9&n=6
Webpage
www.stackoverflow.com
www.rdatamining.com
www.crossvalidated.com
www.genometoolbox.blogspot.it
http://www.otexts.com/http://pareonline.net/getvn.asp?v=9&n=6http://www.stackoverflow.com/http://www.rdatamining.com/http://www.crossvalidated.com/http://www.genometoolbox.blogspot.it/
GROUP 8 |MAIN PAGE 15
APPENDIX