12
Chintan Gandhi Capstone Project 1 Project: Predictive Analytics Capstone Task 1: Determine Store Formats for Existing Stores What is the optimal number of store formats? How did you arrive at that number? Figure 1. Adjusted Rand Indices and Calinski-Harabasz Indices

Project: Predictive Analytics Capstone Task 1: Determine ... · Project: Predictive Analytics Capstone Task 1: Determine Store Formats for Existing Stores • What is the optimal

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Project: Predictive Analytics Capstone Task 1: Determine ... · Project: Predictive Analytics Capstone Task 1: Determine Store Formats for Existing Stores • What is the optimal

Chintan Gandhi Capstone Project

1

Project: Predictive Analytics Capstone

Task 1: Determine Store Formats for Existing Stores

• What is the optimal number of store formats? How did you arrive at that number?

Figure 1. Adjusted Rand Indices and Calinski-Harabasz Indices

Page 2: Project: Predictive Analytics Capstone Task 1: Determine ... · Project: Predictive Analytics Capstone Task 1: Determine Store Formats for Existing Stores • What is the optimal

Chintan Gandhi Capstone Project

2

Figure 2. Adjusted Rand Indices and Calinski-Harabasz Indices

Based on the observations of the AR and CH indices, the data can be segmented into 3, 4 or 5 clusters, since they have the higher median values.

Page 3: Project: Predictive Analytics Capstone Task 1: Determine ... · Project: Predictive Analytics Capstone Task 1: Determine Store Formats for Existing Stores • What is the optimal

Chintan Gandhi Capstone Project

3

Figure 3. IQ,Median,Max and Min values for 3, 4 and 5 clusters

From the values in Figure 3, the median values for 3 clusters is the highest. However, for 4 and 5 clusters the IQ range are lesser than that for 3 clusters. The differences for the median are much larger when compared to the IQ ranges for the clusters. Hence, the cluster model to be developed will contain 3 clusters.

• How many stores fall into each store format?

Figure 4. Cluster information obtained from the K-Means clustering model

From the K-means clustering model, Figure 4 shows cluster information. For each cluster shown, the size indicates the number of stores that fall in each segment.

• Based on the results of the clustering model, what is one way that the clusters differ from

one another?

Figure 5. Variance seen in each category in the clusters

From Figure 5, for the Dairy category it can be seen the value for cluster 2 is 0.70; cluster 1 is -0.76; and for cluster 3 is -0.01. The values for the 3 clusters being distinctly apart indicates that these three clusters differ in sales made in the Dairy category. For cluster 2 and cluster 1 would be most different and cluster 3 would be intermediate.

Page 4: Project: Predictive Analytics Capstone Task 1: Determine ... · Project: Predictive Analytics Capstone Task 1: Determine Store Formats for Existing Stores • What is the optimal

Chintan Gandhi Capstone Project

4

• Please provide a Tableau visualization (saved as a Tableau Public file) that shows the location of the stores, uses color to show cluster, and size to show total sales.

Figure 6. Cluster map of store segments

Page 5: Project: Predictive Analytics Capstone Task 1: Determine ... · Project: Predictive Analytics Capstone Task 1: Determine Store Formats for Existing Stores • What is the optimal

Chintan Gandhi Capstone Project

5

Task 2: Formats for New Stores • What methodology did you use to predict the best store format for the new stores? Why

did you choose that methodology? (Remember to Use a 20% validation sample with

Random Seed = 3 to test differences in models.)

Figure 7. Decision Tree, Forest Model and Boosted Model comparison

The Boosted Model gives an accuracy of 82.35%, The PPV is calculated to be 80% for cluster

1, 67% for cluster 2 and 100% for cluster 3. The F1 score is 88.89%.

The Decision Tree Model gives an accuracy of 70.59%, The PPV is calculated to be 60% for

cluster 1, 67% for cluster 2 and 83.3% for cluster 3. The F1 score is 76.85%.

The Forest Model gives an accuracy of 82.35%, The PPV is calculated to be 75% for cluster 1,

80% for cluster 2 and 87.5% for cluster 3. The F1 score is 82.35%.

The model to be used for classification will be the Boosted Model. It gives the best accuracy to

identify stores in segments for cluster 1 and cluster 3 compared to the other two models. Lastly,

it also has the highest F1 score.

Page 6: Project: Predictive Analytics Capstone Task 1: Determine ... · Project: Predictive Analytics Capstone Task 1: Determine Store Formats for Existing Stores • What is the optimal

Chintan Gandhi Capstone Project

6

• What format do each of the 10 new stores fall into? Please fill in the table below.

Store Number Segment

S0086 3

S0087 2

S0088 1

S0089 2

S0090 2

S0091 1

S0092 2

S0093 1

S0094 2

S0095 2

Figure 8. Segments identified for new stores

Page 7: Project: Predictive Analytics Capstone Task 1: Determine ... · Project: Predictive Analytics Capstone Task 1: Determine Store Formats for Existing Stores • What is the optimal

Chintan Gandhi Capstone Project

7

Task 3: Predicting Produce Sales 1. What type of ETS or ARIMA model did you use for each forecast? Use ETS(a,m,n) or

ARIMA(ar, i, ma) notation. How did you come to that decision?

Figure 9. Time Series decomposition plot

From figure 9, the graph will enable to determine the ETS model. By looking at the

seasonal plot the seasonality changes nominally in magnitude every year, not noticeable

in the graph seen. In order to account for this, a multiplicative method will be used. The

trend line is neither line linear nor quadratic, so no method will be used. The remainder

graph indicates error of varying magnitudes and a multiplicative method will be used.

The ETS model will be a MNM model.

Page 8: Project: Predictive Analytics Capstone Task 1: Determine ... · Project: Predictive Analytics Capstone Task 1: Determine Store Formats for Existing Stores • What is the optimal

Chintan Gandhi Capstone Project

8

Figure 9. ACF and PACF plots from Time Series decomposition

From figure 9, the ACF plot is not stationary and shows a seasonal pattern. The time

series will be differenced to obtain a stationary time series.

Page 9: Project: Predictive Analytics Capstone Task 1: Determine ... · Project: Predictive Analytics Capstone Task 1: Determine Store Formats for Existing Stores • What is the optimal

Chintan Gandhi Capstone Project

9

Figure 10. ACF and PACF plot for seasonal difference

From figure 10, the ACF plot is not stationary and further difference is done by taking a

seasonal first difference.

Page 10: Project: Predictive Analytics Capstone Task 1: Determine ... · Project: Predictive Analytics Capstone Task 1: Determine Store Formats for Existing Stores • What is the optimal

Chintan Gandhi Capstone Project

10

Figure 11. ACF and PACF plot for seasonal first difference

From the seasonal first difference seen in figure 11, the series is now stationary. For the

non-seasonal component at lag -1, it can be seen from the ACF plot that the correlation

is negative and then cuts off to zero. In the PACF plot, we can see plot gradually drops

to zero. This indicates MA 1 term, which is q =1. Also, since this is a first difference, the

value for d = 1. For the seasonal component at lag -12, in both ACF and PACF the

component has negative correlation and is significant. At lag -24, in both ACF and

PACF, the component cuts off to zero. This indicates a MA 1 term, where Q =1. Also,

since it is the first difference D = 1. The m value is 12 since seasonal duration is 12

months. The ARIMA model is ARIMA(0,1,1)(0,1,1)[12].

Page 11: Project: Predictive Analytics Capstone Task 1: Determine ... · Project: Predictive Analytics Capstone Task 1: Determine Store Formats for Existing Stores • What is the optimal

Chintan Gandhi Capstone Project

11

Figure 12. ETS MNM model errors

Figure 13. ARIMA(0,1,0)(0,1,1)[12] model errors

Looking at Figure 12 and Figure 13, the RMSE and MASE errors in ARIMA model are

smaller compared to the ETS MNM model.

Figure 14. Errors from both models from TS comparison

From figure 14, the ETS model gives less errors compared to the ARIMA model when

used to predict the holdout sample. The RMSE and MAPE values are lower in the ETS

MNM model when compared to the ARIMA model.

From the analysis done, the ETS MNM model would be used for forecasting future

values.

Page 12: Project: Predictive Analytics Capstone Task 1: Determine ... · Project: Predictive Analytics Capstone Task 1: Determine Store Formats for Existing Stores • What is the optimal

Chintan Gandhi Capstone Project

12

2. Please provide a table of your forecasts for existing and new stores. Also, provide

visualization of your forecasts that includes historical data, existing stores forecasts, and

new stores forecasts.

Figure 15. Forecasted Produce Sales for Existing and New Stores

Figure 16 Produce Sales historical and forecasted sales