14
Koffi Pierre Yao MIT 12/30/2015 An algorithm for model-free, parameter-less clustering of stochastically distributed data This report documents the performance of a density-based clustering algorithm designed to bypass the need for the data scientist to tune clustering parameters. Most clustering algorithms available in Python sklearn require the tuning of parameters such as the number of clusters, the neighborhood size, the branching which require domain knowledge for the data or excellent data science skills. Although data science skills are incontrovertible for good data analysis, a clustering algorithm requiring minimal prior knowledge of the data structure is undeniably of great usefulness. Figure 1: A comparison of the clustering algorithms in scikit-learn. Information available at http://scikit-learn.org/stable/modules/clustering.html Herein, I report design a model-free, parameter-free clustering algorithm. Principle: A point density surface is calculated as: [] = −([],[]) =1 Farther points have exponentially diminished influence on the density score of a given point. This approach is expected to concentrate density at the location of the cluster center where points are expected to be densest if the process generating the data points is reasonably stochastic around the true cluster center. Note that the distance measure does not need to be Euclidian.

DCSM report2

Embed Size (px)

Citation preview

Page 1: DCSM report2

Koffi Pierre Yao MIT 12/30/2015

An algorithm for model-free, parameter-less clustering of

stochastically distributed data

This report documents the performance of a density-based clustering algorithm designed to bypass

the need for the data scientist to tune clustering parameters. Most clustering algorithms available

in Python sklearn require the tuning of parameters such as the number of clusters, the neighborhood

size, the branching which require domain knowledge for the data or excellent data science skills.

Although data science skills are incontrovertible for good data analysis, a clustering algorithm

requiring minimal prior knowledge of the data structure is undeniably of great usefulness.

Figure 1: A comparison of the clustering algorithms in scikit-learn. Information available at

http://scikit-learn.org/stable/modules/clustering.html

Herein, I report design a model-free, parameter-free clustering algorithm.

Principle:

A point density surface is calculated as:

𝑑𝑒𝑛𝑠𝑖𝑡𝑦[𝑖] = ∑ 𝑒−𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑝𝑜𝑖𝑛𝑡[𝑖],𝑝𝑜𝑖𝑛𝑡[𝑘])

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠

𝑘=1

Farther points have exponentially diminished influence on the density score of a given point.

This approach is expected to concentrate density at the location of the cluster center where

points are expected to be densest if the process generating the data points is reasonably

stochastic around the true cluster center. Note that the distance measure does not need to be

Euclidian.

Page 2: DCSM report2

Koffi Pierre Yao MIT 12/30/2015

Cluster centers are identified as the peaks (local maxima) of the density surface using a

numerical gradient ascent algorithm with random starting points.

Once the centers are found, cluster assignment is done by proximity (in terms of the

distance measure) to a given center.

Random Sampling is used for large data sets to speed the finding of centers

Results compared to DBSCAN (the least parametrized clustering algorithm):

Page 3: DCSM report2

Koffi Pierre Yao MIT 12/30/2015

Figure 2: Examples of clustering using the parameter-less density-based algorithm presented

herein (left column graphs) compared to DBSCAN with fixed eps = 1 (right column). The ground

truth actual groupings of points is presented in the middle column. Cluster centers are shown

with a star marker in my algorithm. Data are blobs generated using sklearn ‘make_blobs’

In figure 2, it is clear that for random blobs of data, the algorithm presented herein outperforms

DBSCAN without the need for parameters tuning.

Figure 3: Visualization of cluster centers search using convexity of density surface. Cluster

centers found are marked with red stars. The density equation above tends to generate a surface

Page 4: DCSM report2

Koffi Pierre Yao MIT 12/30/2015

which is concave-down around the cluster centers where the concentration of points tends to be

greatest.

Caution:

Figure 4: Clustering strongly anisotropic data result in degraded performance. However,

without parameter tuning, the presented algorithm achieves better clustering fidelity compared

to DBSCAN. In these degenerate cases the user might need to invoke data transformation tools

or modify the input a custom distance measure function (using dist_type = ‘custom’ and

dist_func = <<custom_function_handle>>).

Using the code:

import dcsm2 as DCSM

# Initialization

dcluster = DCSM.DCSM(data = <row_data_matrix>, dist_type = <number or ‘norm2’ or

‘custom’>, redundant_center_discard = <True or False>, attributions = <'yes' or ‘no’>, sampling

= <'y' or ‘n’>, dist_func = <custom_function_handle>, centers_proximity_ratio =

<number_less_than_1>)

# Fitting

Page 5: DCSM report2

Koffi Pierre Yao MIT 12/30/2015

center_points, attributions = dcluster.find_medoids()

# Content of dcluster after fitting:

dcluster.data: All data imput during the clustering training

dcluster.sample: indices (within dcluster.data) of the sample data used by the algorithm to find

the cluster centers.

dcluster.center_points: coordinates of the cluster centers after training.

dcluster.attributions: cluster assignments of the all points in dcluster.data arranged as <point

coordinate, cluster center index in dcluster.center_points>. It is None if attributions is set to ‘no’

at initialization.

dcluster.dist_type: The distance function type used. It is either value of p in a p-norm distance

measure or ‘custom’ if a custom distance function is to be used.

dcluster.kwargs: A dictionary record of the argument,value pairs used at the initialization.

dcluster.assign(row_arranged_data): Outputs the cluster assignment of a new point. Output is

arranged as numpy array [data point coordinates, cluster number]

Conclusion:

I report a density-based clustering algorithm for parameter-less clustering of stochastically

distributed clustered data. This algorithm shows a much improved clustering performance

without parameter tuning compared to the least parametrized DBSCAN clustering module

available in sklearn.

The code and example are presented below.

Page 6: DCSM report2

C:\Users\Pierre\OneDrive\PYTHON DENSITY CLUSTERING\dcsm2.py Friday, January 1, 2016 10:52 PM# UNSUPERVISED CLUSTER CENTERS IDENTIFICATION USING NON-ITERATIVE # DENSITY ANALYSIS AND NON-ANALYTICAL GRADIENT DESCENT#################WRITTEN BY KOFFI PIERRE CLAVER YAO###############################################DATE: 12/28/2015#################################CURRENT ADDRESS: MASSACHUSETTS INSTITUTE OF TECHNOLOGY##########import numpy as npfrom math import expimport copyclass DCSM:

''' This Module attempts to cluster data based on density of points The 'data' is assumed to be arranged with points as rows of the data table as a numpy array Works best with Gaussian distributed data. This algorithm computes the density matrix and therefore can be memory O(n^2) memory intensive. Therefore, <<<<Random Sampling>>>> is used to optimize the algorithm performance **kwargs: redundant_center_discard = True/False: If the algorithm converges on several centers,

should close centers be removed? time_show = True/False, show the time needed to do the fitting or not attributions = 'yes'/ 'no', center assignments. Any string not 'no' is considered 'yes'. centers_proximity_ratio = <float number>: How close can the centers be as a ratio of

the maximum separation data = The data that will be fitted sampling = 'y'/'n' should the algorithm use sampling to search for the centers? Default

is 'yes' when the number of data points exceeds 500 dist_func = custom distance function handle.Only called if dist_type = 'custom' EXAMPLE USAGE: import dcsm2 as DCSM dcluster = DCSM.DCSM(data = data, dist_type = 4, redundant_center_discard = True,

attributions = 'yes', sampling = 'n') center_points, attributions = dcluster.find_medoids() dcluster.assign([[5,-10],[6,7]]) # attributions: assignments to centers formatted in columns: [point_index_in_data,

index_of_center_in_center_points] # center_indices: indices of the cluster centers in the data set # center_points: actually coordinates of the centers of the data set data = ...x1... ...x2... ...x3... . . . ...xn...

-1-

Page 7: DCSM report2

C:\Users\Pierre\OneDrive\PYTHON DENSITY CLUSTERING\dcsm2.py Friday, January 1, 2016 10:52 PM '''

####################################################################def __init__(self, **kwargs):

self.center_removed = []self.attributions = Noneself.center_indices = Noneself.data = kwargs.get('data',[])

self.dist_type = kwargs.get('dist_type',2)self.kwargs = kwargs####################################################################def __sample__(self, num_pts, size):

if size >= self.num_pts: self.sample = range(num_pts)else:

self.sample = np.concatenate((np.array(self.center_indices, dtype = int),np.random.choice(range(num_pts),size = size, replace = False)))

####################################################################

# DENSITY CLUSTERING#################################################def find_medoids(self):

self.num_pts, self.num_feature = self.data.shape# Number of iterations that the algorithm will use to find the centers.# I find that more than one iteration will degrade the result instead # of improving it.it = self.kwargs.get('num_iter',1)answer = 'n'self.center_indices = []center_score = []size = self.num_ptsif (self.num_pts > 500): # stochastic sampling search of centers using random subgroups

answer = self.kwargs.get('sampling','y')if answer.lower() == 'y':

# Defining the number of points in the sample data to be used for finding center clusters# The sample should be sufficiently large. The formula below is heurestic (500 points for 2D data# containing more than 500 data pointssize = (self.num_feature-1) * int(min(self.num_pts/5,500))

for i in xrange(it):self.__sample__(self.num_pts, size)dens,pair_distance = self.sample_data_density()temp_score = self.stoch_search_local_maxima(dens, pair_distance, 20)temp = np.where(temp_score > 0)[0]

-2-

Page 8: DCSM report2

C:\Users\Pierre\OneDrive\PYTHON DENSITY CLUSTERING\dcsm2.py Friday, January 1, 2016 10:52 PM# Remapping to the original indices in datafor i in temp:

org_loc = self.sample[i]if org_loc not in self.center_indices:

self.center_indices.append(org_loc)center_score.append(temp_score[i])

else:center_score[self.center_indices.index(org_loc)] +=temp_score[i]

self.center_indices = np.array(self.center_indices)center_score = np.array(center_score)# Discarding Redundant Centersif self.kwargs.has_key('redundant_center_discard'):

if self.kwargs['redundant_center_discard'] == True:ratio = self.kwargs.get('centers_proximity_ratio', 0.1)self.discard_redundant_centers(center_score, ratio)

self.center_points = self.data[self.center_indices,:]# Attributingif self.kwargs.has_key('attributions'):

if self.kwargs['attributions'] != 'no':self.attribute()

return copy.deepcopy(self.center_points), copy.deepcopy(self.attributions)####################################################################def attribute(self):

self.cluster_attr_by_dist_to_center()####################################################################def distance(self, X1, X2):

# This function calculates the distance of X1 and X2 per the# the specified order of the norm. It doesn't take the 1/p# The only acceptable kwargs are dist_func = <function handle># when the distance function is a custom function of two data pointsif (self.dist_type == 'custom'):

funct = self.kwargs['dist_func']dist = funct(X1,X2)

if (self.dist_type == 'norm2'):dist = np.linalg.norm(X1-X2, ord = None)dist = dist**2

else:dist = np.linalg.norm(X1-X2, ord = self.dist_type)dist**(self.dist_type)

return dist#####################################################################def sample_data_density(self):

# Calculates the density matrix of the sample data based on the specified# distance type and returns the distance between two pairs of data

-3-

Page 9: DCSM report2

C:\Users\Pierre\OneDrive\PYTHON DENSITY CLUSTERING\dcsm2.py Friday, January 1, 2016 10:52 PM# points and the density at each pointscdata = self.data[self.sample,:]num_pts, ignore = cdata.shapedata_density = [0]*num_ptspair_distance = np.zeros((num_pts,num_pts))for i in xrange(num_pts-1):

if (i < num_pts-1):for j in xrange(i+1, num_pts):

d = self.distance(cdata[i], cdata[j])pair_distance[i,j] = dpair_distance[j,i] = dd = exp(-d)data_density[i] = data_density[i] + ddata_density[j] = data_density[j] + d

del ignoredata_density = np.array(data_density)return data_density, pair_distance

#####################################################################def n_nearest_neighbor(self, point_index, n, dist_from_point):

# n = How many nearest neighbors to seeksort_indices = sorted(range(len(dist_from_point)), key=lambda k: dist_from_point[k])# The first index will be itself because a point is closest to itselfreturn sort_indices[1:n+1]

#####################################################################def numerical_grad_descent(self, start, dens,pair_distance,n):

# This function returns the local maximum of the density surface# by implementing a point to point numerical gradient ascent count_no_jumps = 0jumpedto = self.jump_up_to(start, dens, pair_distance, n)while count_no_jumps < 2:

start = jumpedtojumpedto = self.jump_up_to(start, dens, pair_distance, n)if (jumpedto == start):

count_no_jumps += 1else:

count_no_jumps = 0 # Reset the count because we jumped return jumpedto

#####################################################################def jump_up_to(self, start, dens, pair_distance, n):

# Use n nearest neighbors to decide where to ascend.dist_from_point = pair_distance[start,:]neighbors_indices = self.n_nearest_neighbor(start, n, dist_from_point)grad =float('-inf')neighb_jumpto = startfor neighb in neighbors_indices:

temp = (dens[neighb]-dens[start])# if a neighbor is found that has a higher density locally than the current point,# Jump to that neighbor. We select the neighbor with the highest density compared to#the current pointif (temp < 0):

-4-

Page 10: DCSM report2

C:\Users\Pierre\OneDrive\PYTHON DENSITY CLUSTERING\dcsm2.py Friday, January 1, 2016 10:52 PMcontinue

elif (temp > grad):neighb_jumpto = neighbgrad = temp

return neighb_jumpto#####################################################################def stoch_search_local_maxima(self, dens, pair_distance, n):

# This function returns a numpy array of how many times the search# Terminated on a particular point as indicated by its indexnum_pts = len(dens)winners = [0]*num_ptsfor i in xrange(num_pts//10):

# I am performing a stochastic search using 10% as much data as in the original# This choice could be computationally intensivestart = np.random.choice(np.array(range(num_pts)))loc_max = self.numerical_grad_descent(start, dens,pair_distance,n)winners[loc_max] += 1

return np.array(winners)##################################################################### def cluster_attr_by_dist_to_center(self):

# This function assigns cluster to each point in data based on # proximity to the cluster center stored in center indicesnum_points, ignore = self.data.shapepoints_indices = range(num_points)centers = [0]*num_pointsfor i in xrange(num_points):

dist = float('inf')k = 0for c in self.center_indices:

pd = self.distance(self.data[i],self.data[c])if pd < dist:

dist = pdcenters[i] = k # closest center wins

k = k + 1self.attributions = np.array([points_indices,centers]).T

#####################################################################def discard_redundant_centers(self, center_score, ratio):

# This function discards a center if it is too close to another# (as a ratio of the maximum distance between centers) and has# a lower convergence score than the otherind_to_rm = list()l = len(self.center_indices)couples = [[i,j] for i in xrange(l) for j in xrange(i+1,l)]inter_center_dist =[self.distance(self.data[self.center_indices[i]],self.data[self.center_indices[j]]) fori in xrange(l) for j in xrange(i+1,l)]if len(inter_center_dist) < 2:

print 'ONLY ONE CLUSTER CENTER FOUND, CAN''T DISCARD ANY \n'else:

m = max(inter_center_dist)-5-

Page 11: DCSM report2

C:\Users\Pierre\OneDrive\PYTHON DENSITY CLUSTERING\dcsm2.py Friday, January 1, 2016 10:52 PMtemp = [couples[t] for t in range(len(inter_center_dist)) if inter_center_dist[t] <m*ratio]for el in temp:

if center_score[el[0]] <= center_score[el[1]]:if (el[1] not in ind_to_rm): ind_to_rm.append(el[1])else: ind_to_rm.append(el[0])

else:if (el[0] not in ind_to_rm): ind_to_rm.append(el[0])else: ind_to_rm.append(el[1])

ind_to_rm = list(set(ind_to_rm))self.center_removed = ind_to_rmself.center_indices = np.delete(self.center_indices,ind_to_rm)

############################################################################################################################################################################################################### def assign(self, data):

# This function assigns cluster to each point in data based on # proximity to the cluster center stored in center indicestry:

s = data.shapeexcept:

data = np.array(data)s = data.shape

try:ignore = s[1]num_points = s[0]

except:num_points = 1

centers = [0]*num_pointsfor i in xrange(num_points):

dist = float('inf')k = 0for c in self.center_indices:

pd = self.distance(data[i],self.data[c])if pd < dist:

dist = pdcenters[i] = k # closest center wins

k = k + 1centers = np.array(centers)try:

assigned = np.concatenate((data,centers))except:

centers = np.array([centers])print 'CENTERS', centersprint 'DATA', dataassigned = np.concatenate((data,centers.T), axis = 1)

-6-

Page 12: DCSM report2

C:\Users\Pierre\OneDrive\PYTHON DENSITY CLUSTERING\dcsm2.py Friday, January 1, 2016 10:52 PMreturn assigned

#####################################################################

-7-

Page 13: DCSM report2

C:\Users\Pierre\OneDrive\PYTHON DENSITY CLUSTERING\example_class.py Friday, January 1, 2016 10:39 PMimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.cluster import KMeansfrom sklearn.datasets import make_blobs, make_circlesfrom sklearn import clusterimport copyimport density_clustering as dcreload(dc)import timeimport dcsm2 as DCSMreload(DCSM)

from mpl_toolkits.mplot3d import Axes3Dfrom matplotlib import cmfrom matplotlib.ticker import LinearLocator, FormatStrFormattern_samples = 5000random_state = 170centers=np.random.randint(1,20)#centers = 5print 'NUMBER OF CENTERS = ',centersdata, y = make_blobs(n_samples=n_samples, random_state=random_state, centers=centers)#data, y = make_circles(n_samples=n_samples, random_state=random_state)# Anisotropicly distributed data#transformation = [[ 0.60834549, -0.63667341], [-0.40887718, 0.85253229]]#data = np.dot(data, transformation)

plt.close('all')dist_type = 2

# DENSITY CLUSTERING#################################################ti = time.time()# Initializationdcluster = DCSM.DCSM(data = data, dist_type = 2, redundant_center_discard = False, attributions= 'yes', sampling = 'y')center_points, attributions = dcluster.find_medoids() #centers_proximity_ratio = 0.1tdcluster = time.time() - tiprint 'Duration My algorithm = ', tdcluster

#####################################################################

-1-

Page 14: DCSM report2

C:\Users\Pierre\OneDrive\PYTHON DENSITY CLUSTERING\example_class.py Friday, January 1, 2016 10:39 PMti = time.time()cores, labels = cluster.dbscan(data, eps = 1)tdbscan = time.time()- tiprint 'Duration DBSCAN= ', tdbscan###################################################################### Heat Scatter plot of data densitycolors = attributions[:,1]v = range(len(center_points))f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, sharex= True, figsize = (21,5))ax1.scatter(data[:, 0], data[:, 1],c=colors, s = 50)plt.hold(True)peak_pts = center_pointsax1.scatter(peak_pts[:, 0], peak_pts[:, 1],marker= '*', c = v,s = 1000)titre = 'My algorithm (time elapsed = ' + str(round(tdcluster,3)) + ')'ax1.set_title(titre)ax2.scatter(data[:, 0], data[:, 1],c=y, s = 50)ax2.set_title("Actual clusters slearn blobs")titre = 'DBSCAN with eps = 1 (time elapsed = ' + str(round(tdbscan,3)) + ')'ax3.scatter(data[:, 0], data[:, 1], c = labels, s = 50)ax3.set_title(titre)f.show()dcluster.sample = range(dcluster.data.shape[0])dens, pair_distance = dcluster.sample_data_density()peaks = copy.deepcopy(dcluster.center_indices)

# 3D plot of data densityfig = plt.figure(figsize = (10,8))ax = fig.gca(projection='3d', axisbg='gray')X = data[:,0]Y = data[:,1]Xp = peak_pts[:,0]Yp = peak_pts[:,1]dp = dens[peaks]ax.scatter(X, Y, dens, zdir='z', marker = 'o', s=50, color = (0.33725490196078434,0.7058823529411765, 0.9137254901960784), depthshade=True)plt.hold(True)ax.scatter(Xp, Yp, dp, zdir='z', marker= '*', s = 300, c='r', depthshade=False)plt.title("VISUALIZATION OF FINDING CLUSTER CENTERS \n USING DATA DENSITY")plt.show()

-2-