16
Prepared by Matricola Ana Paula Zavadniak 172345 Jacopo Gelli 172769 Marco Delprato 172178 Phuong Nam Nguyen 172558 Data Analysis and Forecasting | December 19, 2014 Department of Economics & Management University of Trento GROUP 8 - MAIN STUDY ON CONSUMPTION PATTERN OF CIGARETTE

Group 8 MAIN 2014-Data Analysis Report

Embed Size (px)

DESCRIPTION

Multiple linear regression

Citation preview

  • Prepared by Matricola

    Ana Paula Zavadniak 172345

    Jacopo Gelli 172769

    Marco Delprato 172178

    Phuong Nam Nguyen 172558

    Data Analysis and Forecasting | December 19, 2014

    Department of Economics & Management

    University of Trento

    GROUP 8 - MAIN

    STUDY ON CONSUMPTION PATTERN OF CIGARETTE

  • GROUP 8 |MAIN PAGE 1

    PREVIEW

    The main goal of our project is to forecast the consumption of cigarettes in 50 States and the

    District of Columbia (51 states in total), basing on six different variables, namely:

    Age (x1): Median age of a person living in a state

    HS (x2): Percentage of people over 25 years of age in a state who had completed

    high school

    Income (x3): Per capita personal income for a state (income in dollars)

    Black (x4): Percentage of black people living in a state

    Female (x5): Percentage of females living in a state

    Price (x6): weighted average price (in cents) of a pack of cigarettes in a state.

    The dependent variable (Y) is Sales - Number of packs of cigarettes sold in a state on a per

    capita basis. For full raw dataset please turn to APPENDIX section on page 14.

    Our objective is to construct a multiple regression model that resembles the following format:

    = + 1x1 + 2x2 + 3x3 + 4x4 + 5x5 + 6x6

    Sales (Y)

    Age (X1)

    HS (X2)

    Income (X3)

    Black (X4)

    Female (X5)

    Price (X6)

  • GROUP 8 |MAIN PAGE 2

    DESCRIPT ION AND DATA ANALYS IS

    We first start by plotting the value of Sales against six predictors (Age, HS, Income, Black,

    Female, Price). The result came out (Figure1) showing that Female and Black seems to show

    some skew. Also we can spot some certain outliers in the graph which may have some

    effect on correlation between our variables.

    In order to have a clearer view about the relationship between Black and Female with Sales,

    it is recommendable to investigate existing outliers, perform logarithm transformation and

    then re-plotting the data to see if theres improvement.

  • GROUP 8 |MAIN PAGE 3

    Figure 1: Scatterplot of full dataset

    As we investigate the dataset, we identify some of the following values that can be

    considered potential outliers: AK, DC and NH.

  • GROUP 8 |MAIN PAGE 4

    In order to access whether these outliers have any influence on significance level of our

    model, we first compute regression with the full set of data (no outliers removed)

    If we compute the regression removing all existing outliers from our dataset, Adjusted R2 rises

    from 0.2282 to 0.2441, which is not enough.

  • GROUP 8 |MAIN PAGE 5

    We run regression model without DC: Adjusted R2 falls down to 0.1321, even worse than the

    first scenario. Although DC contains two outliers, we do not think it is rational to remove DC

    completely since this variable may play some certain role in our forecasting model.

    Instead, we run the regression model without AK and NH: Adjusted R2 = 0.4016, which is the

    best result so far

    R2 by nature describes the strength of the linear relationship between the independent

    variables Xi and dependent variable Y. Decreasing R means that our forecasting model

  • GROUP 8 |MAIN PAGE 6

    becomes less accurate, which contradict with our main goal. Therefore, we only leave AK

    and NH out of the forecasting dataset, and keep the remaining unchanged.

    This is what the scatter plot looks like after omitting the two outliers AK and NH.

    Figure 2: Scatter plot of adjusted data

    Again, Black still seem to be skewed. To see what is going on inside the bulk of data, we will

    now take logarithm of predictor Black with remaining predictors against dependent variable

    Sales.

  • GROUP 8 |MAIN PAGE 7

    Surprisingly, as we do transformation, responsible Adjusted R2 drops down, while p-value

    increased, so we decide not to perform any transformation with the data to preserve

    models reliability.

    CHOICE OF THE MODEL

    With seven predictors, it may be a good idea to figure out which are the most suitable ones

    to be included in the forecasting model, and which one should be left out. There are several

    different methods, which base on the value of R, CV, AIC, BIC Generally the model with

    minimum AIC is most likely to be the best model for forecasting since it also lower the value

    of CV as the number of observations getting larger. We choose to go with this method, using

    command [step] and let R do the calculation of AIC.

    \

  • GROUP 8 |MAIN PAGE 8

    As automatically calculated by R, the model with lowest AIC contains four predictors

    Age/Income/Black and Price appear to be the optimal choice and values of intercept (0)

    and relevant coefficients (i) are generated. This regression model from now on will be

    referred to as fitAIC

    Our optimal model can now be written as follow:

    Sales = 31.5782 + 4.08678(Age) + 0.01701(Income) + 0.55917(Black) 2.4969(Price)

    The figures above suggest that the consumption of Sales positively correlated with

    Age/Income and percentage of black people in each State, while negatively correlated

    with retail price of the product. In particular:

    The intercept value is 31.5782 and alone doesnt make sense, but it is an important

    part of the model;

    1 year increase in people median age will result an average increase of 4.0867 in Sales

    number;

    1 dollar increase in income corresponds to an average increase of 0.01701 in Sales;

    1% increase of black people will result to an average expansion of 0.55917 in Sales;

    On the other hand, 1 dollar rise in Price will lead to an average 2.497 falls in Sales

    record.

    MODEL F I T T ING

    Now that we came up with the forecasting model, we need to check whether all four

    predictors are useful or we can drop one or several of them. We have 24 = 16 models in total,

    all of which are summarized as in table below (the tick [x] indicate the variable to be

    included in our model)

  • GROUP 8 |MAIN PAGE 9

    The best one that has lowest CV/AIC and highest Adj. R2 is placed on top. Again, the model

    with four predictors Age/Income/Black/Price still the best option, which reconfirm the result

    drawn from the AIC method illustrated above.

    RESIDUALS DIAGNOSTIC

    Now that we have select the regression variables, which contains four inputs, we will now

    plot the residuals in order to make sure that the assumptions of the model have been

    satisfied (residuals must have zero mean and uncorrelated with each predictor).

  • GROUP 8 |MAIN PAGE 10

    We can see that there is no pattern so the relationship is not nonlinear.

    To be on the safe side, we also plot the residuals against the two predictors previously

    eliminated (Female and HS). In case any pattern arise, we need to re-added those two

    variables to the model.

    Again, theres no pattern spotted, so no further action needed in this case.

    The next part is to access whether heteroscedasticity occurs by plotting the residuals

    against the fitted value

    No certain systematic pattern here and the variation in the residuals do not change with the

    size of the fitted values. Therefore, there is no need to transform our forecasted variable Sales.

  • GROUP 8 |MAIN PAGE 11

    We continue our analysis plotting the histogram of the residuals:

    In our case, the residual seems to be slightly positively skewed, although that it is probably

    due to the Washington DC outlier that we choose to not remove.

    Now we look at the Q-Q plot:

  • GROUP 8 |MAIN PAGE 12

    Clearly, the data almost fit the normal distribution.

    Of course, this is no guarantee that this particular normal distribution is the best distribution for

    this data set. Nonetheless, it is a useful tool to visualize the goodness-of-fit of a data set to a

    distribution.

    Finally, in order to assess the accuracy of our model, we will now pick up some random

    values of predictor variable from both inside and outside observation range, plug them into

    the model, then compare with predicted values indicated by R.

    Internal range: Age = 25; Income = 4000; Black = 15; Price = 40

    Sales Yin = 31.57820 + 4.08687*25 + 0.01701*4000 + 0.59917*15 - 2.49690*40 = 110. 3183

    External range: Age = 35; Income = 5500; Black = 75; Price = 50

    Sales Yex = 31.57820 + 4.08687*35 + 0.01701*5500 + 0.59917*75 - 2.49690*50 = 185.2894

    From the results above we draw the conclusion that our model is relatively plausible since the

    value of Y taken from corresponding values of X from both internal and external range lies

    within the predicted spectrum of our model.

  • GROUP 8 |MAIN PAGE 13

    PRACT ICAL IMPLICAT IO N

    Now lets recall of our model

    Sales = 31.5782 + 4.08678(Age) + 0.01701(Income) + 0.55917(Black) 2.4969(Price)

    From the figures above we have come up with the following inferences:

    The number of Sales in 51 States can be predicted base on study on several predictors

    including median age of people in each state, consumers income, demographic factor in

    which indicate buyers race as well as average price of cigarette (per pack). Two other

    values (i.e. proportion of Female users and percentage of user over 25 already completed

    High school education) do not seem to have any certain impact on Sales of cigarette and

    thus were excluded from the model.

    Three variables Age, Income and Black are positively correlated with Sales, while Price has

    negative effect on this dependent variable. This suggest the Company to target the suitable

    customer segment (senior, wealthy, black) or offer some discount (low down retail price) in

    order to boost their Sales record.

  • GROUP 8 |MAIN PAGE 14

    REFERENCES

    Reading

    Rob J Hyndman and George Athanasopoulos (August 2014), Forecasting: Principles and

    Practice, Print Edition, available at www.otexts.com

    Osborne, Jason W. & Amy Overbay (2004), The power of outliers (and why researchers

    should always check for them), Practical Assessment, Research & Evaluation, 9(6). Retrieved

    December 5, 2014 from http://PAREonline.net/getvn.asp?v=9&n=6

    Webpage

    www.stackoverflow.com

    www.rdatamining.com

    www.crossvalidated.com

    www.genometoolbox.blogspot.it

    http://www.otexts.com/http://pareonline.net/getvn.asp?v=9&n=6http://www.stackoverflow.com/http://www.rdatamining.com/http://www.crossvalidated.com/http://www.genometoolbox.blogspot.it/

  • GROUP 8 |MAIN PAGE 15

    APPENDIX