Customer Segmentation & Opportunity Analysis · Customer Segmentation & Opportunity Analysis (using R programming language) A Project work as per the Dissertation (MCS491) paper of

Customer Segmentation & Opportunity Analysis

(using R programming language)

A Project work as per the Dissertation (MCS491) paper of final semester to obtain the degree of

Master of Science(M.Sc.) in Computer Science

Submitted by

Mr. Abhijit Bag

(Reg.No:161541810001 & Roll No:15499016029)

On 7th MAY,2018

under supervision of

Mr. Subhajit Adhikari

Dinabandhu Andrews Institute of Technology & Management (Maulana Abul Kalam Azad University of Technology)

Customer Segmentation & Opportunity Analysis 2018

2 | P a g e

Declaration Of Originality And Compliance Of Academic Ethics I hereby declare that this thesis contents original research work done by me, as part of master of computer science studies. All information in this document has been obtained and presented in accordance with the academic rules and ethical conduct. I also declare that, as required by these rules and conduct I have fully cited and referenced all the materials. Name:- Abhijit Bag Roll No:- 15499016029 Reg. No:- 161541810001 Project Title :- “Customer Segmentation & Opportunity Analysis” ……………………………………. Signature & Date


3 | P a g e

Acknowledgement

I would like to express my sincere, heart-felt gratitude to my respected guide Assistant Professor Mr.Subhajit Adhikari, Department of Computer Science in DAITM under MAKAUT, for this unfailing guidance, prolific encouragement, constructive suggestion and continuous involvement during each and every phase of this work. I am also thankful to Dr. Sanjukta Nandy, Principle, DAITM and Ms. Paramita Roy, HOD - Department of Computer Science and all other faculty members and stuffs for providing me all the facilities and their support to complete these activities. I would like to express my gratitude to my family & parents for their belief, mental support and guidance. Last but not the least; I would like to thank all my classmates of MSc(CS) batch 2016-18 for their friendly co-operation and support.


4 | P a g e

Certificate of Approval

This is certified that the work entitled as “Customer Segmentation & Opportunity Analysis” has been satisfactorily completed by Abhijit Bag (Roll No-15499016029; Reg. No-161541810001). It is a bonafied work carried out under my supervision at DAITM, Kolkata for fulfilment of MSc in Computer Science during the academic year 2016-2018. It is understood that by this approval the undersigned do not necessarily endorse or approve any statement made, opinion expressed or conclusion drawn there in but approve this project only for the purpose for which it has been submitted.

……………………………………………. Signature of examiner Date:


5 | P a g e

To Whom It May Concern This is certified that the work entitled as “Customer Segmentation & Opportunity Analysis” has been satisfactorily completed by Mr. Abhijit Bag (Roll No-15499016029, Reg. No-161541810001). It is a bonafied work carried out under my supervision at DAITM Kolkata for partial fulfilment of MSc in Computer Science during the academic year 2017-18.

…………………………………………………….. Project Guide Mr. Subhajit Adhikari Assistant Professor, Dinabandhu Andrews Institute Of Technology & Management, Kolkata

…………………………………………………….. Forwarded by Ms. Paramita Ray HOD, Dept. of Computer Science, Dinabandhu Andrews Institute Of Technology & Management Kolkata.


6 | P a g e

Customer Segmentation & Why it Matters At its most basic, customer segmentation (also known as market segmentation) is the division of potential customers in a given market into discrete groups. That division is based on variables and descriptors of those customers having similar enough:

1. Needs, i.e., so that a single whole product can satisfy them.

2. Buying characteristics, i.e., responses to messaging, marketing channels, and sales channels, that a single go-to-market approach can be used to sell to them competitively and economically.

More details are mentioned below:


7 | P a g e

There are three main approaches to market segmentation:

A priori segmentation, the simplest approach, uses a classification scheme based on publicly available characteristics — such as industry and company size — to create distinct groups of customers within a market. However, a priori market segmentation may not always be valid, since companies in the same industry and of the same size may have very different needs.

Needs-based segmentation is based on differentiated, validated drivers (needs) that customers express for a specific product or service being offered. The needs are discovered and verified through primary market research, and segments are demarcated based on those different needs rather than characteristics such as industry or company size.

Value-based segmentation differentiates customers by their economic value, grouping customers with the same value level into individual segments that can be distinctly targeted.

Benefits of Customer Segmentation At the expansion stage, executing a marketing strategy without any knowledge of how your target market is segmented is akin to firing shots at a target 100 feet away — while blindfolded. The likelihood of hitting the target is a matter of luck more than anything else.

Without a deep understanding of how a company’s best current customers are segmented, a business often lacks the market focus needed to allocate and spend its precious human and capital resources efficiently. Furthermore, a lack of best current customer segment focus can cause diffused go-to-market and product development strategies that hamper a company’s ability to fully engage with its target segments. Together, all of those factors can ultimately impede a company’s growth.


8 | P a g e

If best current customer segmentation is done right, however, the business benefits are numerous. For example, a best current customer segmentation exercise can tangibly impact your operating results by:

1. Improving your whole product: Having a clear idea of who wants to buy your product and what they need it for will help you differentiate your company as the best solution for their individual needs. The result will be increased satisfaction and better performance against competitors. The benefits also extend beyond your core product offering, since any insights into your best customers will allow your organization to offer better customer support, professional services, and any other offerings that make up their whole product experience.

2. Focusing your marketing message: In parallel with improvements to the product, conducting a customer segmentation project can help you develop more focused marketing messages that are customized to each of your best segments, resulting in higher quality inbound interest in your product.

3. Allowing your sales organization to pursue higher percentage opportunities: By spending less time on less lucrative opportunities and more on your most successful segments, your sales team will be able to increase its win rate, cover more ground, and ultimately increase revenues.

4. Getting higher quality revenues: Not all revenue dollars are created equal. Sales into the wrong segment can be more expensive to sell and maintain, and may have a higher churn rate or lower upsell potential after the initial purchase has been made. Staying away from these types of customers and focusing on better ones will increase your margins and promote the stability of your customer base.

Conducting best current customer segmentation research can have numerous other ancillary benefits, of course, but this guide will focus primarily on how it can impact the four cited above. The bottom line is that if you are able to sell more of your product to your most profitable customers, then you will be able to scale the business more efficiently and ensure that everything you do — from lead generation to new product development — revolves around the right things.


9 | P a g e

Customer Segmentation Using Cluster Analysis

In brief, cluster analysis uses a mathematical model to discover groups of similar customers based on finding the smallest variations among customers within each group. The process is not based on any predetermined thresholds or rules (as are most simple segmentation methods), but rather the data itself generates the customer prototypes that inherently exist within the population of customers. The two main advantages of cluster analysis over simple threshold/rule-based segmentation are -

practicality – it would be practically impossible to use predetermined rules to segment customers over many dimensions, and

homogeneity – variances within each resulting group are very small in cluster analysis, whereas rule-based segmentation typically groups customers who are actually very different from one another.

The customer segmentation process can be performed with various clustering algorithms. We focused on k-means clustering in R. While the algorithm is quite simple to implement, half the battle is getting the data into the correct format and interpreting the results. We went over formatting the order data, running the kmeans() function to cluster the data with several hypothetical kk clusters, using silhouette() from the cluster package to determine the optimal number of kk clusters, and interpreting the results by inspection of the k-means centroids.

How K-Means Algorithm Works

The k-means clustering algorithm works by finding like groups based on Euclidean distance, a measure of distance or similarity. The practitioner selects kk groups to cluster, and the algorithm finds the best centroids for the kk groups. The practitioner can then use those groups to determine which factors group members relate. For customers, these would be their buying preferences.


10 | P a g e

K-Means clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the nearest mean. This method produces exactly k different clusters of greatest possible distinction. The best number of clusters k leading to the greatest separation (distance) is not known as a priori and must be computed from the data. The objective of K-Means clustering is to minimize total intra-cluster variance, or, the squared error function:

Algorithm

1. Clusters the data into k groups where k is predefined. 2. Select k points at random as cluster centres. 3. Assign objects to their closest cluster centre according to the Euclidean

distance function. 4. Calculate the centroid or mean of all objects in each cluster. 5. Repeat steps 2, 3 and 4 until the same points are assigned to each

cluster in consecutive rounds.

K-Means is relatively an efficient method. However, we need to specify the number of clusters, in advance and the final results are sensitive to initialization and often terminates at a local optimum. Unfortunately there is no global theoretical method to find the optimal number of clusters. A practical approach is to compare the outcomes of multiple runs with different k and choose the best one based on a predefined criterion. In general, a large k probably decreases the error but increases the risk of over fitting.

Getting Started With Data

To start, we’ll get need some orders to evaluate. If you’d like to follow along, we will be using the bikes data set, which has already been retrieved. We’ll load the data first using the xlsx package for reading Excel files.


11 | P a g e

Next, we’ll get the data into a usable format, typical of an SQL query from an ERP database. The following code merges the customers, products and orders data frames using the dplyr package.

Developing A Hypothesis For Customer Trends Developing a hypothesis is necessary as the hypothesis will guide our decisions on how to formulate the data in such a way to cluster customers. For the Cannondale orders, our hypothesis is that bike shops purchase Cannondale bike models based on features such as Mountain or Road Bikes and price tier (high/premium or low/affordable). Although we will use bike model to cluster on, the bike model features (e.g. price, category, etc) will be used for assessing the preferences of the customer clusters (more on this later).

To start, we’ll need a unit of measure to cluster on. We can select quantity purchased or total value of purchases. We’ll select quantity purchased because total value can be skewed by the bike unit price. For example, a premium bike can be sold for 10X more than an affordable bike, which can mask the quantity buying habits.

https://en.wikipedia.org/wiki/Enterprise_resource_planning


12 | P a g e

Manipulating The Data Frame Next, we need a data manipulation plan of attack to implement clustering on our data. We’ll user our hypothesis to guide us. First, we’ll need to get the data frame into a format conducive to clustering bike models to customer id’s. Second, we’ll need to manipulate price into a categorical variables representing high/premium and low/affordable. Last, we’ll need to scale the bike model quantities purchased by customer so the k-means algorithm weights the purchases of each customer evenly.

We’ll tackle formatting the data frame for clustering first. We need to spread the customers by quantity of bike models purchased.

Next, we need to convert the unit price to categorical high/low variables. One way to do this is with the cut2() function from the Hmisc package. We’ll segment the price into high/low by median price. Selecting g = 2 divides the unit prices into two halves using the median as the split point.

Last, we need to scale the quantity data. Unadjusted quantities presents a problem to the k-means algorithm. Some customers are larger than others meaning they purchase higher volumes. Fortunately, we can resolve this issue by converting the customer order quantities to proportion of the total bikes purchased by a customer. The prop.table() matrix function provides a convenient way to do this. An alternative is to use the scale() function, which normalizes the data. However, this is less interpretable than the proportion format.

The final data frame (first five rows shown below) is now ready for clustering.


13 | P a g e

K-Means Clustering Now we are ready to perform k-means clustering to segment our customer-base. Think of clusters as groups in the customer-base. Prior to starting we will need to choose the number of customer groups, kk, that are to be detected. The best way to do this is to think about the customer-base and our hypothesis. We believe that there are most likely to be at least four customer groups because of mountain bike vs road bike and premium vs affordable preferences. We also believe there could be more as some customers may not care about price but may still prefer a specific bike category. However, we’ll limit the clusters to eight as more is likely to overfit the segments.

Running The K-Means Algorithm on the dataset

The code below does the following:

1. Converts the customerTrends data frame into kmeansDat.t. The model and features are dropped so the customer columns are all that are left. The data frame is transposed to have the customers as rows and models as columns. The kmeans() function requires this format.

2. Performs the kmeans() function to cluster the customer segments. We set minClust = 4 and maxClust = 8. From our hypothesis, we expect there to be at least four and at most six groups of customers. This is because customer preference is expected to vary by price (high/low) and category1 (mountain vs bike). There may be other groupings as well. Beyond eight segments may be overfitting the segments.


14 | P a g e

3. Uses of the silhouette() function to obtain silhouette widths. Silhouette is a technique in clustering that validates the best cluster groups. The silhouette() function from the cluster package allows us to get the average width of silhouettes, which will be used to programmatically determine the optimal cluster size.

Next, we plot the silhouette average widths for the choice of clusters. The best cluster is the one with the largest silhouette average width, which turns out to be 5 clusters.

Which customers are in each segment?

Now that we have clustered the data, we can inspect the groups find out which customers are grouped together. The code below groups the customer names by cluster X1 through X5.

Determining The Preferences Of The Customer Segments

The easiest way to determine the customer preferences is by inspection of factors related to the model (e.g. price point, category of bike, etc). Advanced algorithms to classify the groups can be used if there are many factors, but typically this is not necessary as the trends tend to jump out. The code below

https://en.wikipedia.org/wiki/Silhouette_(clustering)


15 | P a g e

attaches the k-means centroids to the bike models and categories for trend inspection.

Now, on to cluster inspection.

CLUSTER 1

We’ll order by cluster 1’s top ten bike models in descending order. We can quickly see that the top 10 models purchased are predominantly high-end and mountain. The all but one model has a carbon frame.

CLUSTER 2

Next, we’ll inspect cluster 2. We can see that the top models are all low-end/affordable models. There’s a mix of road and mountain for the primary category and a mix of frame material as well.


16 | P a g e

CLUSTERS 3, 4 & 5

Inspecting clusters 3, 4 and 5 produce interesting results. For brevity, we won’t display the tables. Here’s the results:

Cluster 3: Tends to prefer road bikes that are low-end. Cluster 4: Is very similar to Cluster 2 with the majority of bikes in the

low-end price range. Cluster 5: Tends to refer road bikes that are high-end.

Reviewing Results Once the clustering is finished, it’s a good idea to take a step back and review what the algorithm is saying. For our analysis, we got clear trends for four of five groups, but two groups (clusters 2 and 4) are very similar. Because of this, it may make sense to combine these two groups or to switch from kk = 5 to kk = 4 results.


17 | P a g e

Conclusion & Future Scope While this guide provides a step-by-step process for identifying, prioritizing,

and targeting your best current customer segments, simply following it does

not guarantee success. To be effective, you must prepare and plan for the

various challenges and hurdles that each step may present, and always

make sure to adapt your process to any new information or feedback that

might change its output.

Additionally, you cannot force feed this process on your business. If the key

stakeholders that will be impacted by the best current customers

segmentation process do not fully buy-in, then the outputs produced from it

will be relatively meaningless.

If you properly manage the best current customer segmentation process,

however, the impact it can have on every part of your organization — sales,

marketing, product development, customer service, etc. — is immense. Your

business will possess stronger customer focus and market clarity, allowing it

to scale in a far more predictable and efficient manner.

Ultimately, that means no longer needing to take on every customer that is

willing to pay for your product or service, which will allow you to instead hone

in on a specific subset of customers that present the most profitable

opportunities and efficient use of resources. That is critical for every

business, of course, but at the expansion stage, it can often be the difference

between incredible success and certain failure.


18 | P a g e

REFERENCES Dr. R. Gardener “The Essential R Reference” (2014),

Concepts of customer segmentation http://www.business-science.io

https://labs.openviewpartners.com/customer-segmentation/

Source data related to our analysis has been collected from https://github.com/mdancho84/orderSimulatoR/tree/master/data

https://www.kaggle.com/

https://www.r-project.org/

https://www.rstudio.com/

http://www.business-science.io/

https://labs.openviewpartners.com/customer-segmentation/#.WvH2fKSFOM8

https://github.com/mdancho84/orderSimulatoR/tree/master/data

https://www.kaggle.com/

https://www.r-project.org/

https://www.rstudio.com/

Documents

Customer Segmentation & Opportunity Analysis · Customer Segmentation & Opportunity Analysis (using R programming language) A Project work as per the Dissertation (MCS491) paper of