Group 8 MAIN 2014-Data Analysis Report

Prepared by Matricola

Ana Paula Zavadniak 172345

Jacopo Gelli 172769

Marco Delprato 172178

Phuong Nam Nguyen 172558

Data Analysis and Forecasting | December 19, 2014

Department of Economics & Management

University of Trento

GROUP 8 - MAIN

STUDY ON CONSUMPTION PATTERN OF CIGARETTE

GROUP 8 |MAIN PAGE 1

PREVIEW

The main goal of our project is to forecast the consumption of cigarettes in 50 States and the

District of Columbia (51 states in total), basing on six different variables, namely:

Age (x1): Median age of a person living in a state

HS (x2): Percentage of people over 25 years of age in a state who had completed

high school

Income (x3): Per capita personal income for a state (income in dollars)

Black (x4): Percentage of black people living in a state

Female (x5): Percentage of females living in a state

Price (x6): weighted average price (in cents) of a pack of cigarettes in a state.

The dependent variable (Y) is Sales - Number of packs of cigarettes sold in a state on a per

capita basis. For full raw dataset please turn to APPENDIX section on page 14.

Our objective is to construct a multiple regression model that resembles the following format:

= + 1x1 + 2x2 + 3x3 + 4x4 + 5x5 + 6x6

Sales (Y)

Age (X1)

HS (X2)

Income (X3)

Black (X4)

Female (X5)

Price (X6)


DESCRIPT ION AND DATA ANALYS IS

We first start by plotting the value of Sales against six predictors (Age, HS, Income, Black,

Female, Price). The result came out (Figure1) showing that Female and Black seems to show

some skew. Also we can spot some certain outliers in the graph which may have some

effect on correlation between our variables.

In order to have a clearer view about the relationship between Black and Female with Sales,

it is recommendable to investigate existing outliers, perform logarithm transformation and

then re-plotting the data to see if theres improvement.


Figure 1: Scatterplot of full dataset

As we investigate the dataset, we identify some of the following values that can be

considered potential outliers: AK, DC and NH.


In order to access whether these outliers have any influence on significance level of our

model, we first compute regression with the full set of data (no outliers removed)

If we compute the regression removing all existing outliers from our dataset, Adjusted R2 rises

from 0.2282 to 0.2441, which is not enough.


We run regression model without DC: Adjusted R2 falls down to 0.1321, even worse than the

first scenario. Although DC contains two outliers, we do not think it is rational to remove DC

completely since this variable may play some certain role in our forecasting model.

Instead, we run the regression model without AK and NH: Adjusted R2 = 0.4016, which is the

best result so far

R2 by nature describes the strength of the linear relationship between the independent

variables Xi and dependent variable Y. Decreasing R means that our forecasting model


becomes less accurate, which contradict with our main goal. Therefore, we only leave AK

and NH out of the forecasting dataset, and keep the remaining unchanged.

This is what the scatter plot looks like after omitting the two outliers AK and NH.

Figure 2: Scatter plot of adjusted data

Again, Black still seem to be skewed. To see what is going on inside the bulk of data, we will

now take logarithm of predictor Black with remaining predictors against dependent variable

Sales.


Surprisingly, as we do transformation, responsible Adjusted R2 drops down, while p-value

increased, so we decide not to perform any transformation with the data to preserve

models reliability.

CHOICE OF THE MODEL

With seven predictors, it may be a good idea to figure out which are the most suitable ones

to be included in the forecasting model, and which one should be left out. There are several

different methods, which base on the value of R, CV, AIC, BIC Generally the model with

minimum AIC is most likely to be the best model for forecasting since it also lower the value

of CV as the number of observations getting larger. We choose to go with this method, using

command [step] and let R do the calculation of AIC.

\


As automatically calculated by R, the model with lowest AIC contains four predictors

Age/Income/Black and Price appear to be the optimal choice and values of intercept (0)

and relevant coefficients (i) are generated. This regression model from now on will be

referred to as fitAIC

Our optimal model can now be written as follow:

Sales = 31.5782 + 4.08678(Age) + 0.01701(Income) + 0.55917(Black) 2.4969(Price)

The figures above suggest that the consumption of Sales positively correlated with

Age/Income and percentage of black people in each State, while negatively correlated

with retail price of the product. In particular:

The intercept value is 31.5782 and alone doesnt make sense, but it is an important

part of the model;

1 year increase in people median age will result an average increase of 4.0867 in Sales

number;

1 dollar increase in income corresponds to an average increase of 0.01701 in Sales;

1% increase of black people will result to an average expansion of 0.55917 in Sales;

On the other hand, 1 dollar rise in Price will lead to an average 2.497 falls in Sales

record.

MODEL F I T T ING

Now that we came up with the forecasting model, we need to check whether all four

predictors are useful or we can drop one or several of them. We have 24 = 16 models in total,

all of which are summarized as in table below (the tick [x] indicate the variable to be

included in our model)


The best one that has lowest CV/AIC and highest Adj. R2 is placed on top. Again, the model

with four predictors Age/Income/Black/Price still the best option, which reconfirm the result

drawn from the AIC method illustrated above.

RESIDUALS DIAGNOSTIC

Now that we have select the regression variables, which contains four inputs, we will now

plot the residuals in order to make sure that the assumptions of the model have been

satisfied (residuals must have zero mean and uncorrelated with each predictor).


We can see that there is no pattern so the relationship is not nonlinear.

To be on the safe side, we also plot the residuals against the two predictors previously

eliminated (Female and HS). In case any pattern arise, we need to re-added those two

variables to the model.

Again, theres no pattern spotted, so no further action needed in this case.

The next part is to access whether heteroscedasticity occurs by plotting the residuals

against the fitted value

No certain systematic pattern here and the variation in the residuals do not change with the

size of the fitted values. Therefore, there is no need to transform our forecasted variable Sales.


We continue our analysis plotting the histogram of the residuals:

In our case, the residual seems to be slightly positively skewed, although that it is probably

due to the Washington DC outlier that we choose to not remove.

Now we look at the Q-Q plot:


Clearly, the data almost fit the normal distribution.

Of course, this is no guarantee that this particular normal distribution is the best distribution for

this data set. Nonetheless, it is a useful tool to visualize the goodness-of-fit of a data set to a

distribution.

Finally, in order to assess the accuracy of our model, we will now pick up some random

values of predictor variable from both inside and outside observation range, plug them into

the model, then compare with predicted values indicated by R.

Internal range: Age = 25; Income = 4000; Black = 15; Price = 40

Sales Yin = 31.57820 + 4.08687*25 + 0.01701*4000 + 0.59917*15 - 2.49690*40 = 110. 3183

External range: Age = 35; Income = 5500; Black = 75; Price = 50

Sales Yex = 31.57820 + 4.08687*35 + 0.01701*5500 + 0.59917*75 - 2.49690*50 = 185.2894

From the results above we draw the conclusion that our model is relatively plausible since the

value of Y taken from corresponding values of X from both internal and external range lies

within the predicted spectrum of our model.


PRACT ICAL IMPLICAT IO N

Now lets recall of our model

Sales = 31.5782 + 4.08678(Age) + 0.01701(Income) + 0.55917(Black) 2.4969(Price)

From the figures above we have come up with the following inferences:

The number of Sales in 51 States can be predicted base on study on several predictors

including median age of people in each state, consumers income, demographic factor in

which indicate buyers race as well as average price of cigarette (per pack). Two other

values (i.e. proportion of Female users and percentage of user over 25 already completed

High school education) do not seem to have any certain impact on Sales of cigarette and

thus were excluded from the model.

Three variables Age, Income and Black are positively correlated with Sales, while Price has

negative effect on this dependent variable. This suggest the Company to target the suitable

customer segment (senior, wealthy, black) or offer some discount (low down retail price) in

order to boost their Sales record.


REFERENCES

Reading

Rob J Hyndman and George Athanasopoulos (August 2014), Forecasting: Principles and

Practice, Print Edition, available at www.otexts.com

Osborne, Jason W. & Amy Overbay (2004), The power of outliers (and why researchers

should always check for them), Practical Assessment, Research & Evaluation, 9(6). Retrieved

December 5, 2014 from http://PAREonline.net/getvn.asp?v=9&n=6

Webpage

www.stackoverflow.com

www.rdatamining.com

www.crossvalidated.com

www.genometoolbox.blogspot.it

http://www.otexts.com/http://pareonline.net/getvn.asp?v=9&n=6http://www.stackoverflow.com/http://www.rdatamining.com/http://www.crossvalidated.com/http://www.genometoolbox.blogspot.it/


APPENDIX

Documents

Group 8 MAIN 2014-Data Analysis Report