Cluster Analysis - Keyword Clustering

Preview:

Citation preview

KEYWORD CLUSTERING

Understanding search behavior using R and Tableau

Introduction

■ Why is keyword clustering important?– To understand what your visitors are trying to accomplish– To identify the profitable keywords for the website– To group the keywords into logical groups, such that the work towards

one positively impacts the results of another

■ Challenges?– Google has made it difficult to analyze search keywords over the past

years (due to their passing of “(not provided)” instead of the actual keywords)

Concept: K-Means Clustering/Unsupervised Learning

■ Unsupervised: trying to understand the structure of our underlying data, rather than trying to optimize for a specific, pre-labeled criterion

– No assumptions on data (contrast with pre-defined relationships such as visitors from mobile or visitors from referral)

■ k-means clustering: method of partitioning data into ‘k’ subsets, where each data element is assigned to the closest cluster based on the distance of the data element from the center of the cluster.

Converting Text to Numeric Data

■ In order to use k-means clustering with text data, text-to-numeric transformation is done

■ R has packages to convert text to numeric data (RSiteCatalyst, RTextTools, Document term matrix)

■ In the DTM, each row is a search term and each column is a 1/0 representation of whether a single word is contained within natural search term.

Keyword Augmentation

■ stemWords reduces a word down to its root, which is a standardization method to avoid having multiple versions of words referring to the same concept (e.g. argue, arguing, argued reduces to  ’argu’)

■ removeStopwords eliminates common English words such as “they”, “he” , “always”

■ minWordLength sets the minimum number of characters that constitutes a ‘word’, which is set to 1

■ removePunctuation removes periods, commas, etc.

Inspecting Common Elements

Guessing at ‘k’: A First Run at Clustering■ One downside to using k-means clustering as a technique is that the user

must choose ‘k’, the number of clusters expected from the dataset

– K can be chosen manually, by guessing (but requires reclustering till all keywords are clustered)

– K can be chosen using elbow method

Elbow method: Finding breakpoints in our cost plot

After the slope becomes flat, each additional cluster becomes less effective at reducing the distance from the each data center.

So while single ‘best’ value of ‘k’ is not determind, the range of values for ‘k’ to evaluate has been determined

Output from clustering activity Naming the clusters and tagging as per the theme

Tableau Report Snapshot

Recommended