Matlab - Clustering Data Outputs Irregular Plot Graph - Stack Overflow


Citation preview

8/19/2019 Matlab - Clustering Data Outputs Irregular Plot Graph - Stack Overflow 1/4

sign up log in tour help

_ Stack Overflow is a community of 4.7

million programmers, just like you,

helping each other.

Join them; it only takes a minute:

Sign up

Join the Stack Overflow community to:

Ask programming


Answer and hel p

your peers

Get recognized for your


clustering data outputs irregular plot graph

Ok I will run down what im trying to achieve and how I tryed to achieve it then I will explain why I tryed this method.

I have data from the KDD cup 1999 in its original format the data has 494k of rows with 42 columns.

My goal is trying to cluster this data unsupervised. From a previous question here:

clustering and matlab

I recieved this feedback:

For starters, you need to normalize the attributes to be of the same scale: when computing the euclidean distance as part of step 3 in your method, the features with values such as 239 and 486 will dominate over the other features with small values as 0.05, thus disrupting the


Another point to remember is that too many attributes can be a bad thing (curse of dimensionality). Thus you should look into feature selection

or dimensionality reduction techniques.

So the first thing I went about doing was addressing the feature selection which is related to this article:

and looks like this after selecting the necessary features:

So for the clustering I removed the discrete values which left me with 3 columns with numeric data, I then went about removing the duplicate rows

see: in the file which reduced the 3 columns from 494k to 67k which was done like


junk, index and unique on a matrix (how to keep matrix format)

[M,ind] = unique(data, 'rows', 'first');

[~,ind] = sort(ind);

M = M(ind,:);

I then used the random permutation to reduce the file size from 67k to 1000 like so:

m = 1000;

n = 3;

%# pick random rows

indX = randperm( size(M,1) );

indX = indX(1:m);

%# pick random columns

indY = randperm( size(M,2) );

indY = indY(1:n);

%# filter data

data = M(indX,indY)

So now I have a file with 3 of my features which I selected I have removed duplicate records and used the random permutation to further reduce

the dataset my last goal was to normalize this data and I did this with:

normalized_data = data/norm(data);

I then used the following K-means script:

%% generate clusters

K = 4;

%% cluster

opts = statset('MaxIter', 500, 'Display', 'iter');

[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...

'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);

%% plot data+clusters

8/19/2019 Matlab - Clustering Data Outputs Irregular Plot Graph - Stack Overflow 2/4

figure, hold on

scatter3(data(:,1),data(:,2),data(:,3), 50, clustIDX, 'filled')

scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 200, (1:K)', 'filled')

hold off, xlabel('x'), ylabel('y'), zlabel('z')

%% plot clusters quality


[silh,h] = silhouette(data, clustIDX);

avrgScore = mean(silh);

%% Assign data to clusters

% calculate distance (squared) of all instances to each cluster centroid

D = zeros(numObservarations, K); % init distances

for k=1:K

%d = sum((x‐y).^2).^0.5

D(:,k) = sum( ((data ‐ repmat(clusters(k,:),numObservarations,1)).^2), 2);


% find for all instances the cluster closet to it

[minDists, clusterIndices] = min(D, [], 2);

% compare it with what you expect it to be

sum(clusterIndices == clustIDX)

But my results are still coming out like my original question I asked here: clustering and matlab

Here is what the data looks like when plotted:


Can anyone help solve this problem, are the methods im using not the correct methods or is there something im missing?

matlab sorting random cluster-analysis normalization

8/19/2019 Matlab - Clustering Data Outputs Irregular Plot Graph - Stack Overflow 3/4

edited Oct 16 '11 at 16:29 asked Oct 16 '11 at 11:23

Garrith Graham

2,156 11 54 130

2 Answers

Just like to say thanks to cyborg and Amro for helping, I realized that rather than create my own

pre-processing I kept the dimensions as such and I finally managed to get some clustered data!

Out put!

Ofcourse I still have some outliers but if I could get rid of them and plot the graph from -0.2 - 0.2

im sure it would look alot better. But if you look at the original attempt I seem to be getting there!

%% load data

%# read the list of features

fid = fopen('kddcup.names','rt');

C = textscan(fid, '%s %s', 'Delimiter',':', 'HeaderLines',1);


%# determine type of features

C2 = regexprep(C2, '.$',''); %# remove "." at the end

attribNom = [ismember(C2,'symbolic');true]; %# nominal features

%# build format string used to read/parse the actual datafrmt = cell(1,numel(C1));

frmt( ismember(C2,'continuous') ) = '%f'; %# numeric features: read as number

frmt( ismember(C2,'symbolic') ) = '%s'; %# nominal features: read as string

frmt = [frmt:];

frmt = [frmt '%s']; %# add the class attribute

%# read dataset

fid = fopen('kddcup.data_10_percent_corrected','rt');

C = textscan(fid, frmt, 'Delimiter',',');


%# convert nominal attributes to numeric

ind = find(attribNom);

G = cell(numel(ind),1);

for i=1:numel(ind)

[Cind(i),Gi] = grp2idx( Cind(i) );


%# all numeric dataset

fulldata = cell2mat(C);%% dimensionality reduction

columns = 42


%% randomly select dataset

rows = 5000;

%# pick random rows

indX = randperm( size(fulldata,1) );

indX = indX(1:rows);

8/19/2019 Matlab - Clustering Data Outputs Irregular Plot Graph - Stack Overflow 4/4

%# pick random columns

indY = randperm( size(fulldata,2) );

indY = indY(1:columns);

%# filter data

data = U(indX,indY)

%% apply normalization method to every cell

data = data./repmat(sqrt(sum(data.^2)),size(data,1),1)

%% generate sample data

K = 4;

numObservarations = 5000;

dimensions = 42;

%% cluster

opts = statset('MaxIter', 500, 'Display', 'iter');

[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...

'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);

%% plot data+clusters

figure, hold on

scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')

scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')

hold off, xlabel('x'), ylabel('y'), zlabel('z')

%% plot clusters quality


[silh,h] = silhouette(data, clustIDX);

avrgScore = mean(silh);

%% Assign data to clusters

% calculate distance (squared) of all instances to each cluster centroid

D = zeros(numObservarations, K); % init distances

for k=1:K

%d = sum((x‐y).^2).^0.5

D(:,k) = sum( ((data ‐ repmat(clusters(k,:),numObservarations,1)).^2), 2);


% find for all instances the cluster closet to it

[minDists, clusterIndices] = min(D, [], 2);

% compare it with what you expect it to be

sum(clusterIndices == clustIDX)

answered Oct 18 '11 at 3:07 community wiki

Garrith Graham


...or, instead of getting rid of the outliers you could examine them in more detail. When you are working on

anomaly detection, often the outliers are more interesting than the "well-behaved" points. Something

interesting might be happening there. Bob Durrant Oct 18 '11 at 12:03

ahh good point Bob infact the outlies may be the r2l and u2r attacks, I will investigate this further! +1 for

spotting that! Garrith Graham Oct 18 '11 at 18:01

You have a problem in the normalization: . What you probably need to do is to

use: . This computes the norm of

every column of , then it duplicates the answer to the original size of , then it dividesby the norms of the columns.


data_normed = data./repmat(sqrt(sum(data.^2)),size(data,1),1)

data datadata


A better way to reduce the dimensionality of the number of features is

or for sparse data . It might loose some information in the way,

but it's much better than random picking.


U=U(:,1:m) [U,S,V]=svds(data,m)

edited Oct 16 '11 at 22:23 answered Oct 16 '11 at 18:54


6,505 2 19 45

–doesnt reduce the size of the matrix?[U,S,V]=svds(data,m) Garrith Graham Oct 16 '11 at 22:45


Um... did you read the feedback from your earlier post

? It's suggested there you cut your teeth first on something easier as the data you are working

with are from a . Nevertheless, if you want to reduce the dimensionality of the data (and

you probably do, since such high-dimensional data has properties that impact on these kinds of

approaches) then a good survey of preprocessing schemes you could initially try can be found at




very hard problem apphire/pubs/148494.pdf Bob Durrant Oct 17 '11 at 5:42
