5
Hybrid PCA-ILGC Clustering Approach for High Dimensional Data Aina Musdholifah 1,2 , Siti Zaiton Mohd Hashim 1 and Razali Ngah 3 1 Department of Software Engineering Universiti Teknologi Malaysia (UTM) Johor, Malaysia 2 Department of Computer Science and Electronics Universitas Gadjah Mada (UGM) Yogyakarta, Indonesia 3 Wireless Communication Centre, Faculty of Electrical Engineering Universiti Teknologi Malaysia (UTM) Johor, Malaysia [email protected]; [email protected];[email protected] Abstract— The availability of high dimensional dataset that incredible growth, imposes insufficient conventional approaches to extract hidden useful information. As a result, today researchers are challenged to develop new techniques to deal with massive high dimensional data that has not only in term of number of data but also in the number of attributes. In order to improve effectiveness and accuracy of mining task on high dimensional data, an efficient dimensionality reduction method should be executed in data preprocessing stage before clustering technique is applied. Many clustering algorithms has been proposed and used to discover useful information from a dataset. Iterative Local Gaussian Clustering (ILGC) is a simple density based clustering technique that has successfully discovered number of clusters represented in the dataset. In this paper we proposed to use the Principal Component Analysis (PCA) method to preprocess the data prior to ILGC clustering in order to simplify the analysis and visualization of multi dimensional data set. The proposed approach is validated with benchmark classification datasets. In addition, the performance of proposed hybrid PCA-ILGC clustering approach is compared to original ILGC, basic k-means and hybridized k-means. The experimental results indicate that the proposed approach is capable to obtain clusters with higher accuracy, and time taken to process the data was decreased. Keywords: Clustering; iterative local Gaussian clustering algorithm; dimensionality reduction; principal component analysis. I. INTRODUCTION In recent years, the availability high dimensional data have been growing unbridled. The data collected in the world's databases is not just raised in the amount of data but also in the number of attributes. Conversely, it is difficult to justify the growth in quantitative sense rather than qualitative one. Moreover, in vastly competitive era which is customer- centered and service-oriented economy, data is the raw material that fuels business development. Therefore, analyzing the data in the database intelligently is necessary in order to discover valuable information. Clustering as an alternative solution is powerful exploratory technique for extracting valuable information in data [1]. The clustering task is grouping unlabelled large data sets according to their similarity. Due to clustering high dimensional data, several clustering techniques have been introduced by various authors and researchers, such as partitional clustering approaches [2], grid-based clustering techniques and density based methods [3, 4]. However, it is still challenges to develop an efficient clustering technique due to handle some limitations and drawbacks of previous works since large and complexity of high dimensional data. The difficulty of handle high dimensional data can be solved by using dimensional reduction techniques, such as feature reduction. It projects the original high-dimensional data to lower-dimensional space through algebraic transformation. Principal component analysis (PCA) is a commonly used feature reduction in terms of minimizing error. For example, hybridized k-means proposed by [5] utilize PCA to project the dataset to lower-dimension data before basic k-means clustering [6] is executed. The dimension is yet reduced, but choosing the number of cluster is another problem. Kernel nearest neighbor based clustering including ILGC [7] determines the number of cluster automatically. ILGC is also applied for clustering spatio-temporal data which is high dimensional data [8]. However, sometimes it works slowly for data with many attributes. Hence, it is required to propose a new approach for clustering high-dimensional data especially in order to handling the problem of previous clustering algorithms mentioned above. Thus, in this paper propose hybrid PCA-ILGC clustering approach that utilizes PCA method to preprocess the data prior to ILGC clustering in order to simplify the analysis and visualization of multi dimensional data set. Section 2 describes the previous related works. It is followed by a detailed proposed approach for clustering high dimensional data, given in section 3. The experimental results 2012 IEEE International Conference on Systems, Man, and Cybernetics October 14-17, 2012, COEX, Seoul, Korea 978-1-4673-1714-6/12/$31.00 ©2012 IEEE 420

[IEEE 2012 IEEE International Conference on Systems, Man and Cybernetics - SMC - Seoul, Korea (South) (2012.10.14-2012.10.17)] 2012 IEEE International Conference on Systems, Man, and

  • Upload
    razali

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Hybrid PCA-ILGC Clustering Approach for High Dimensional Data

Aina Musdholifah1,2, Siti Zaiton Mohd Hashim1 and Razali Ngah3

1Department of Software Engineering Universiti Teknologi Malaysia (UTM)

Johor, Malaysia 2Department of Computer Science and Electronics

Universitas Gadjah Mada (UGM) Yogyakarta, Indonesia

3Wireless Communication Centre, Faculty of Electrical Engineering Universiti Teknologi Malaysia (UTM)

Johor, Malaysia [email protected]; [email protected];[email protected]

Abstract— The availability of high dimensional dataset that incredible growth, imposes insufficient conventional approaches to extract hidden useful information. As a result, today researchers are challenged to develop new techniques to deal with massive high dimensional data that has not only in term of number of data but also in the number of attributes. In order to improve effectiveness and accuracy of mining task on high dimensional data, an efficient dimensionality reduction method should be executed in data preprocessing stage before clustering technique is applied. Many clustering algorithms has been proposed and used to discover useful information from a dataset. Iterative Local Gaussian Clustering (ILGC) is a simple density based clustering technique that has successfully discovered number of clusters represented in the dataset. In this paper we proposed to use the Principal Component Analysis (PCA) method to preprocess the data prior to ILGC clustering in order to simplify the analysis and visualization of multi dimensional data set. The proposed approach is validated with benchmark classification datasets. In addition, the performance of proposed hybrid PCA-ILGC clustering approach is compared to original ILGC, basic k-means and hybridized k-means. The experimental results indicate that the proposed approach is capable to obtain clusters with higher accuracy, and time taken to process the data was decreased.

Keywords: Clustering; iterative local Gaussian clustering algorithm; dimensionality reduction; principal component analysis.

I. INTRODUCTION In recent years, the availability high dimensional data have

been growing unbridled. The data collected in the world's databases is not just raised in the amount of data but also in the number of attributes. Conversely, it is difficult to justify the growth in quantitative sense rather than qualitative one. Moreover, in vastly competitive era which is customer-centered and service-oriented economy, data is the raw material that fuels business development. Therefore, analyzing the data in the database intelligently is necessary in order to discover valuable information.

Clustering as an alternative solution is powerful exploratory technique for extracting valuable information in data [1]. The clustering task is grouping unlabelled large data sets according to their similarity. Due to clustering high dimensional data, several clustering techniques have been introduced by various authors and researchers, such as partitional clustering approaches [2], grid-based clustering techniques and density based methods [3, 4]. However, it is still challenges to develop an efficient clustering technique due to handle some limitations and drawbacks of previous works since large and complexity of high dimensional data.

The difficulty of handle high dimensional data can be solved by using dimensional reduction techniques, such as feature reduction. It projects the original high-dimensional data to lower-dimensional space through algebraic transformation. Principal component analysis (PCA) is a commonly used feature reduction in terms of minimizing error. For example, hybridized k-means proposed by [5] utilize PCA to project the dataset to lower-dimension data before basic k-means clustering [6] is executed. The dimension is yet reduced, but choosing the number of cluster is another problem.

Kernel nearest neighbor based clustering including ILGC [7] determines the number of cluster automatically. ILGC is also applied for clustering spatio-temporal data which is high dimensional data [8]. However, sometimes it works slowly for data with many attributes. Hence, it is required to propose a new approach for clustering high-dimensional data especially in order to handling the problem of previous clustering algorithms mentioned above. Thus, in this paper propose hybrid PCA-ILGC clustering approach that utilizes PCA method to preprocess the data prior to ILGC clustering in order to simplify the analysis and visualization of multi dimensional data set.

Section 2 describes the previous related works. It is followed by a detailed proposed approach for clustering high dimensional data, given in section 3. The experimental results

2012 IEEE International Conference on Systems, Man, and Cybernetics October 14-17, 2012, COEX, Seoul, Korea

978-1-4673-1714-6/12/$31.00 ©2012 IEEE 420

and description of the algorithms performance are explained in section 4. The conclusion is drawn in section 5.

II. RELATED WORKS

A. Iterative Local Gaussian Clustering (ILGC) ILGC proposed by [7] is a non-parametric density based

approach to determine the density of data. In ILK, KNN density estimation is extended and combined with Gaussian kernel function, where KNN contributes in determining the best local data iteratively for Gaussian kernel density estimation. The local best data is defined as the set of neighbor data that maximizes the Gaussian kernel function. ILGC uses the Bayesian rule in dealing with the problem of selecting “best” local data.

ILGC has simplicity and ability to determine the number of clusters automatically, since using the nonparametric density and Bayes rule. Furthermore, ILGC has capability to handle high dimensional data and requirement of only one parameter. Detail description of ILGC is given in [8].

ILGC assigns each objects xi into only one cluster, where if the class-conditional density function in cluster

iω is maximized, then the target data will belong to the cluster

iω .

Below is the summary of the ILGC algorithm [7] :

1) Set the number of clusters to the N “informative data” selected.

2) Each data xi (i=1..N) with k neighbours is assigned to cluster ω as :

(1)

where ))(ˆ( ωixp is a class-conditional density function at xi for each cluster ω [8].

3) If there are no changes in the data structure, iterations have converged. Re-index the clusters and stop. Otherwise go to step 4.

4) Re-calculate cluster membership in (1), then go to step 2.

B. Principal Component Anaylsis (PCA) Principal Component Analysis (PCA) as unsupervised data

reduction, initially was described by [9]. PCA provide ways to reduce complex data set to a lower dimension which can reveal the hidden, simplified structure that lie beneath it [10]. Thus, main objective of PCA is to reduce dimensionality of the data with the minimum of loss of information, by retain as much as possible the variation in the original data set.

PCA in data mining field can be used in data preparation and visualization step, since it is a simply method to extract relevant information from confusing data sets. PCA mostly has been used for visualizing high dimensional data in

understanding view through reducing the dimension of data [8, 11-13]. PCA is also utilized to reduce either dimension or size of data [5] before clustering or classification process.

There are a couple approach to PCA, the first one is singular value decomposition (SVD). Another approach is to find eigenvalues and eigenvectors of covariance matrix describing the data set. Both methods yield the same information, but take different routes from a computational standpoint.

In brief, PCA can be performed using these steps [14]:

1) Organize a data set as an m x n matrix, where m is the number of attributes (or measurement) and n is the number of data (or instances).

2) Subtract off the mean for each attribute of object xi . 3) Calculate the SVD or the eigenvectors of the

covariance.

III. PROPOSED HYBRID PCA-ILGC APPROACH To improve the efficiency of original ILGC clustering

algorithm that often works quite slowly for high dimension and large dataset, we proposed to utilize PCA on original data set for reducing dataset dimensions. Next, the resulting reduced data set will be applied to the ILGC clustering algorithm to determine the precise number of clusters. Fig. 1 describes the flow of proposed approach.

To avoid the data from domination of certain features, hybrid PCA-ILGC approach use normalization process. This approach starts with Z-score data normalization. The objectives of the normalization process are to reduce the square mean error of approximating the input data by data centering and to get unit variance by standardizing the variables (or data scaling). Using Z-score, an attribute value V of an attribute A is normalized to V’ and defined as:

(2)

Next, SVD method of PCA is applied to the normalized dataset to get PC. Applying the PCA to the result of step gives the number of PCs obtained is same with the number of original variables. To remove the weaker components from this PC set, the variance of PC values and the mean of the variance of PC are calculated. Then, the PCs having variances less than the mean variance is ignoring.

In the result, the transformation matrix with reduced PCs is formed and this transformation matrix is applied to the normalized data set to produce the new reduced projected dataset. Lastly, the original ILGC algorithm is applied to the new reduced projected dataset to find the clusters.

421

Figure 1. Hybrid PCA-ILGC approach for clustering high-dimensional data

Figure 2. Plotting of original synthetic data along with normalized synthetic data

IV. EXPERIMENTS In this section, the experiments conducted are present. All

experimental results were obtained using a desktop computer with 3.25 GB of RAM, Intel Xeon E5420 2.5 GHz CPU, Windows XP professional version 2002 Service Pack 3, and Matlab R2011a. Initially, the proposed approach is evaluated on a synthetic dataset created by [5] with 15 data objects having 10 attributes as shown in Table I. Then three benchmark classification datasets, Pima Indian Diabetes data set, Breast Cancer data set and SPECTF Heart data set, taken from the UCI machine learning repository [15] are used for testing the accuracy and efficiency of the hybrid PCA-ILGC algorithm. Here the Sum of Squared Error (SSE), representing distances between data points and their cluster centers have used to measure the clustering quality. Among two solutions for a given dataset, the smaller value of SSE and higher accuracy, the better solution.

A. Step 1: Normalizing the original data set The original dataset values are scaled so as to fall within a

small-specified range using Z-score normalization process. Fig 2 shows the structure of original synthetic dataset and normalized synthetic dataset. Comparing to original synthetic dataset, the normalized synthetic dataset is more compact and centering.

B. Step 2: Calculating the PCs using Singular Value Decomposition of the normalized data matrix Applying the PCA to the result of step gives the number of

PCs obtained is same with the number of original variables. To remove the weaker components from this PC set, the variance of PC values and the mean of the variance of PC are calculated. Then, the PCs having variances less than the mean variance is ignoring. The reduced PCs are shown in Table 2.

C. Step 3: Finding the reduced data set using the reduced PCs From step 2, the transformation matrix with reduced PCs is

formed and this transformation matrix is applied to the normalized data set to produce the new reduced projected dataset, which next will be used for cluster analysis. The reduced data set is shown in Table 3. We have also applied the PCA on three biological dataset. The characteristics of all dataset including reduced number of attributes obtained are shown in Table 4.

D. Step 4: Comparison of efficiency and accuracy of the original ILGC [7], basic k-means clustering [6], hybridized k-means clustering [5], and proposed algorithm.

The clustering results shown in Fig. 3 and 4 by applying the original ILGC to the original synthetic dataset and the proposed approach to the reduced dataset are approximately same, but the time taken for clustering will be reduced due to less number of attributes. The complete time analysis is given in Table 5.

Normalize the dataset using Z-

score method

Apply the SVD method of PCA to

get PCs

Eliminate the unnecessary

PCs

Find the reduced projected dataset

using reduced PCs

Get the cluster indexes of each object of the reduced

dataset by ILGC

Data collection

422

TABLE I. THE ORIGINAL DATASET WITH 15 DATA OBJECTS AND 10 ATTRIBUTE VALUES

a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 Data1 1 5 1 1 1 2 1 3 1 1 Data2 2 5 4 4 5 7 10 3 2 1 Data3 3 3 1 1 1 2 2 3 1 1 Data4 4 6 8 8 1 3 4 3 7 1 Data5 5 4 1 1 3 2 1 3 1 1 Data6 6 8 10 10 8 7 10 9 7 1 Data7 7 1 1 1 1 2 10 3 1 1 Data8 8 2 1 2 1 2 1 3 1 1 Data9 9 2 1 1 1 2 1 1 1 5 Data10 10 4 2 1 1 2 1 2 1 1 Data11 11 1 1 1 1 1 1 3 1 1 Data12 12 2 1 1 1 2 1 2 1 1 Data13 13 2 1 1 1 2 1 2 1 1 Data14 14 5 3 3 3 2 3 4 4 1 Data15 15 1 1 1 1 2 3 3 1 1

TABLE II. THE REDUCED PCS HAVING VARIANCE GREATER THAN MEAN OF VARIANCE OF PCS

PC1 PC2 PC3 0.161104 -0.70692 0.222666 -0.34575 0.06143 0.104589 -0.37811 -0.12852 0.225285 -0.37785 -0.12565 0.21828 -0.34923 0.061232 -0.09679 -0.34712 0.275694 -0.1149 -0.28668 0.21192 -0.29648 -0.34062 -0.24545 -0.14089 -0.34594 -0.26379 0.318984 0.091787 0.457921 0.780392

TABLE III. THE REDUCED DATASET CONTAINING 3 ATTRIBUTES

a'1 a'2 a'3 Data1 0.223037 1.054463 0.579672 Data2 -1.14725 1.930422 1.181514 Data3 0.356619 0.736366 0.664884 Data4 -1.15879 -0.43648 -1.34541 Data5 0.208821 0.448114 0.524024 Data6 -2.98732 -0.52932 -0.26688 Data7 0.287241 0.517414 1.243295 Data8 0.477612 -0.19456 0.296029 Data9 0.855872 1.759055 -2.94325 Data10 0.450619 -0.31525 0.00696 Data11 0.726638 -0.81182 0.207616 Data12 0.673061 -0.64718 0.088997 Data13 0.688024 -0.80655 0.037868 Data14 -0.29401 -1.52922 -0.51413 Data15 0.639827 -1.17544 0.238801

TABLE IV. CHARACTERISTICS OF BENCHMARK CLASSIFICATION DATASET [15]

Dataset # Instance # Original attributes

# Reduced Attributes

Synthetic 15 10 3 Pima Indian Diabetes 50 8 3

Breast Cancer 80 9 2 SPECTF Heart 40 44 9

Figure 3. Clustering result with original dataset by original ILGC algorithm

Figure 4. Clustering result with reduced dataset by proposed ILGC algorithm

Again we compared the clustering results obtained by the ILGC algorithm by the proposed algorithm over 4 datasets with original dimension and with reduced dimension based on the sum of squared error distances (SSE). The SSE value obtained for 4 datasets with original ILGC, basic k-means, hybridized k-means, and new proposed algorithm is given in Table 6. The experimental results show that the proposed approach provides better SSE values for all the cases. Hence, in this regard it increases the efficiency of the original ILGC algorithm.

The accuracy of clustering, as another measurement of clustering performance for three UCI dataset, is analyzed. Fig. 5 presents the accuracy of the clusters obtained by the experiments using four algorithms. In all the cases the proposed algorithm provides better accuracy compared to the original ILGC [7], basic k-means [6] and hybridized k-means algorithm [5].

423

TABLE V. COMPARISON OF TIME TAKEN (IN MS) FOR CLUSTERING ON SYNTHETIC, PIMA INDIAN DIABETES, BREAST CANCER, AND SPECTF HEART

DATASETS

Dataset Original k-means

Hybridized k-means

Original ILGC

Hybrid PCA-ILGC

Synthetic 170 150 120 63 Pima Indian

Diabetes 158 122 116 85

Breast Cancer 167 131 107 97 SPECTF Heart 145 112 103 89

TABLE VI. COMPARISON OF SSE FOR CLUSTERING ON SYNTHETIC, PIMA INDIAN DIABETES, BREAST CANCER, AND SPECTF HEART DATASETS

Dataset Original k-means

Hybridized k-means

Original ILGC

Hybrid PCA-ILGC

Synthetic 608,45 47,80 599,70 41,37 Pima Indian

Diabetes 596353,90 177,93 501162,70 153,80

Breast Cancer 3181,40 148,78 2758,80 80,16 SPECTF

Heart 97094 1054,67 91234,76 1000,48

Figure 5. Clustering accuracy of all dataset

V. CONCLUSIONS In this paper hybrid PCA-ILGC clustering approach has

been proposed which combines ILGC and the steps of dimensionality reduction through PCA. It is intended to tackle the issue of how to determine the number of clusters automatically, dealing with large and high-dimensional data. Using the proposed algorithm, a given data set was partitioned into k clusters in such a way that the sum of the total clustering errors for all clusters was reduced. The experimental results show that the proposed approach provides has better efficiency

and almost 100% accuracy than basic ILGC algorithm, original k-means and hybridized k-means with reduced time.

ACKNOWLEDGMENT This work is supported by Universiti Teknologi Malaysia

(UTM), Ministry of High Education (MOHE) Malaysia, and VOT number QJ.130000.2528.01H12. The authors gratefully acknowledge many helpful comments by reviewers and members of SCRG in improving the publication.

REFERENCES [1] Musdholifah, A. and S.Z. Mohd. Hashim. Triangular kernel nearest

neighbor based clustering for pattern extraction in spatio-temporal database. 2010.

[2] Steinbach, M., P. Tan, and V. Kumar. Clustering earth science data: Goals, issues and results. in The Fourth KDD Workshop on Mining Scientific Datasets. 2001. San Fransisco.

[3] Birant, D. and A. Kut, ST-DBSCAN: An algorithm for clustering spatial-temporal data. Data and Knowledge Engineering, 2007. 60(1): p. 208-221.

[4] Yin, J., et al. High-dimensional shared nearest neighbor clustering algorithm. 2005. Changsha.

[5] Dash, R., et al., A hybridized K-means clustering approach for high dimensional dataset. Interantional Journal of Engineering Science and Technology, 2010. 2(2): p. 59-66.

[6] Hartigan, J.A. and M.A. Wong, Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 1979. 28(1): p. 100-108.

[7] Wasito, I., S.Z.M. Hashim, and S. Sukmaningrum, Iterative Local Gaussian Clustering for Expressed Genes Identification Linked to Malignancy of Human Colorectal Carcinoma. Bioinformation, 2007. 2(5): p. 175-181.

[8] Musdholifah, A., S.Z.B. Mohd Hashim, and I. Wasito. KNN-kernel based clustering for spatio-temporal database. 2010.

[9] Pearson, K., LIII. On lines and planes of closest fit to systems of points in space. Philosophical Magazine Series 6, 1901. 2(11): p. 559-572.

[10] Martinez, W.L., A.R. Martinez, and J.L. Solka, Exploratory Data Analysis With MATLAB. 2010: CRC Press.

[11] Xu, H. and A. Ma ‘ayan, Visualization of Patient Samples by Dimensionality Reduction of Genome-Wide Measurements Information Quality in e-Health, A. Holzinger and K.-M. Simonic, Editors. 2011, Springer Berlin / Heidelberg. p. 15-22.

[12] Gorban, A.N. and A.Y. Zinovyev, PCA and K-Means Decipher Genome Principal Manifolds for Data Visualization and Dimension Reduction, A.N. Gorban, et al., Editors. 2008, Springer Berlin Heidelberg. p. 309-323.

[13] Bhattacharyya, R., et al., Classification of black tea liquor using cyclic voltammetry. Journal of Food Engineering, 2012. 109(1): p. 120-126.

[14] Shlens, J., A tutorial on principal component analysis. 2005, Institute for Nonlinear Science, University of California, San Diego.

[15] UCI Repository of machine learning databases. 2010, University of California, Irvine, School of Information and Computer Sciences

52%

80.00%

85%

64%

98.75%

92.5%

52%

81.25%

90%

66%

98.75%95%

Pima Indian Diabets Breast Cancer SPECTFF Heart

Original K-means Hybridized K-meansOriginal ILGC Hybridized ILGC

424