5
Proceedings of the 2012 International Conference on Machine Learning and Cybernetics, Xian, 15-17 July, 2012 A THREE-DIMENSIONAL DISPLAY FOR BIG DATA SETS CHENG-LONG MA, XU-FENG SHANG,YU-BO YUAN* Institute of Metrology and Computational Science, China Jiliang University, Hangzhou 310018, Zhejiang Province, China. E-MA: [email protected] Abstract: Facing with high dimensional information in fields of Science, Technology and Commerce, users need effective visualization tools to find more useful information. For big data sets, it is very difficult to get useful information because the dimension is too large for a practical solution. This paper proposes a 3-D visual- ization method for big data sets. First of all, we employed the K-means clustering method to get the basic vectors. Then, we use these vectors to construct the reduction mapping. Finally, we get the three dimensional display for a sample point. To verify the feasibility of this method, we perform experiment on some well-known databases such as iris, wine and a large data set: Pendigits. The results are favorable. According to the 3-D display sults, we can also get messages like classification, out- liers, and classification level when given the level standards. Keywords: Big data mining; High-dimensional data; 3-D visualization; K- means; Inner product; P-norm 1. Introduction With the development of information technology, the diversi- fication of E-commerce, and the large-scale application of data wehouse, increasing volume of information are available for scientific and commercial areas. Moreover, properties reflect- ing scientific results or business information will never be lim- ited to two or three dimension, and this leads to the generation of high dimensional data. Facing with large high dimensional data and due to the inherent limitations of human cognitive abilities, we usually can not acquire information directly[I]. So multidimensional visualization techniques are widely used as effective information abstraction tools for analysis of high dimensional databases in knowledge discovery, information * Corresponding auor: [email protected] 978-1-4673-1487-9/12/$31.00 ©2012 IEEE awareness and decision making process[2].Displaying the dis- tribution of the high dimensional data through low dimension visual space, researchers can quickly become aware of the in- formation like feature, relationship, cluster and trend. The tar- geted visualization can reflect the nature of data effectively. The usual practice is to seek the projection om high di- mensional space to low dimensional space. It's not a simple graphical map. The goal is to find relationship between data and their properties and tries to show the multi-attribute char- acteristics of multi-dimensional abstract information in a low dimensional visualization space[3]. Actually, it is the process of building the imagery of multi-dimensional abstract informa- tion in the mind, then cognize it. In terms of visualization, 2-D or 3-D are both desirable, it depends on the structure of data and the purpose of visualization. For two or three dimensional display, there have already been techniques for it. Keinosuke et al [4] propose a 2-D space visu- alization technique. They use a nonlinear mapping from multi- dimensional space data to two-dimensional coordinates. There are also nonlinear mappings proposed by Ball [5], Sannon [6], Shepd and Caoll [7]. However, these nonlinear mappings may lead to a deformation of the data. Shepard and Caoll [7] tried to minimize the distances between data of different classes to make the deformation within tolerable range. In addition, di- mensionality reduction can also be used for visualization and the dimension should be reduced to the visual range. Lin et al [8] proposed a keel learning dimensionality reduction tech- nology, which can be used to 2-D or 3-D. In this paper, our main puose is to seek a method to project the multi-dimensional data to 3-D space to visualize the in- formation of original data. Of course, the distances between classes need to be minimized. For projection, the most intu- itive and simplest method is to obtain the distances to the three coordinate axes. Thus, data in original space is coespond- ing to points in 3-D space for 3-D visualization. The problem then turns into seeking for basis vectors in 3-D space. Accord- 1541

[IEEE 2012 International Conference on Machine Learning and Cybernetics (ICMLC) - Xian, Shaanxi, China (2012.07.15-2012.07.17)] 2012 International Conference on Machine Learning and

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 International Conference on Machine Learning and Cybernetics (ICMLC) - Xian, Shaanxi, China (2012.07.15-2012.07.17)] 2012 International Conference on Machine Learning and

Proceedings of the 2012 International Conference on Machine Learning and Cybernetics, Xian, 15-17 July, 2012

A THREE-DIMENSIONAL DISPLAY FOR BIG DATA SETS

CHENG-LONG MA, XU-FENG SHANG,YU-BO YUAN*

Institute of Metrology and Computational Science, China Jiliang University, Hangzhou 310018, Zhejiang Province, China.

E-MAIL: [email protected]

Abstract:

Facing with high dimensional information in fields of Science,

Technology and Commerce, users need effective visualization tools

to find more useful information. For big data sets, it is very

difficult to get useful information because the dimension is too

large for a practical solution. This paper proposes a 3-D visual­

ization method for big data sets. First of all, we employed the

K-means clustering method to get the basic vectors. Then, we

use these vectors to construct the reduction mapping. Finally,

we get the three dimensional display for a sample point. To

verify the feasibility of this method, we perform experiment on

some well-known databases such as iris, wine and a large data

set: Pendigits. The results are favorable. According to the 3-D display results, we can also get messages like classification, out­

liers, and classification level when given the level standards.

Keywords:

Big data mining; High-dimensional data; 3-D visualization; K­

means; Inner product; P-norm

1. Introduction

With the development of information technology, the diversi­fication of E-commerce, and the large-scale application of data warehouse, increasing volume of information are available for scientific and commercial areas. Moreover, properties reflect­ing scientific results or business information will never be lim­ited to two or three dimension, and this leads to the generation of high dimensional data. Facing with large high dimensional data and due to the inherent limitations of human cognitive abilities, we usually can not acquire information directly[I]. So multidimensional visualization techniques are widely used as effective information abstraction tools for analysis of high dimensional databases in knowledge discovery, information

* Corresponding author: [email protected]

978-1-4673-1487-9/12/$31.00 ©2012 IEEE

awareness and decision making process[2].Displaying the dis­tribution of the high dimensional data through low dimension visual space, researchers can quickly become aware of the in­formation like feature, relationship, cluster and trend. The tar­geted visualization can reflect the nature of data effectively.

The usual practice is to seek the projection from high di­mensional space to low dimensional space. It's not a simple graphical map. The goal is to find relationship between data and their properties and tries to show the multi-attribute char­acteristics of multi-dimensional abstract information in a low dimensional visualization space[3]. Actually, it is the process of building the imagery of multi -dimensional abstract informa­tion in the mind, then cognize it. In terms of visualization, 2-D or 3-D are both desirable, it depends on the structure of data and the purpose of visualization.

For two or three dimensional display, there have already been techniques for it. Keinosuke et al [4] propose a 2-D space visu­alization technique. They use a nonlinear mapping from multi­dimensional space data to two-dimensional coordinates. There are also nonlinear mappings proposed by Ball [5], Sannon [6], Shepard and Carroll [7]. However, these nonlinear mappings may lead to a deformation of the data. Shepard and Carroll [7] tried to minimize the distances between data of different classes to make the deformation within tolerable range. In addition, di­mensionality reduction can also be used for visualization and the dimension should be reduced to the visual range. Lin et al [8] proposed a kernel learning dimensionality reduction tech­nology, which can be used to 2-D or 3-D.

In this paper, our main purpose is to seek a method to project the multi-dimensional data to 3-D space to visualize the in­formation of original data. Of course, the distances between classes need to be minimized. For projection, the most intu­itive and simplest method is to obtain the distances to the three coordinate axes. Thus, data in original space is correspond­ing to points in 3-D space for 3-D visualization. The problem then turns into seeking for basis vectors in 3-D space. Accord-

1541

Page 2: [IEEE 2012 International Conference on Machine Learning and Cybernetics (ICMLC) - Xian, Shaanxi, China (2012.07.15-2012.07.17)] 2012 International Conference on Machine Learning and

Proceedings of the 2012 International Conference on Machine Learning and Cybernetics, Xian, 15-17 July, 2012

ing to different purposes, we can choose different optimization methods. For example, if data have clustering features, then it is reasonable to expect that the visualized data also have these features. This paper uses Kmeans clustering method.

This paper is organized as follows: section 2 gives the 3-D coordinate projection method. Then, in section 3, we seek the three basis vectors using Kmeans clustering method. To show the practicality of this method, we present our experimental re­sults on real world databases in section 4. Finally, we present the conclusions in section 5.

2. Visualization

For high dimensional data processing, the common method is to perform dimensionality reduction, which performs a trans­formation on the original data and makes the deformation within a tolerable range. If we can find a map which projects high dimensional data to low dimensional space then we recog­nize potential characteristics in data meaningful for data users. High dimensional data is invisible, so we need to present them intuitionally to users more comprehensively. The mapping is given as follows:

i=1,2,3 (1)

where X is a high dimensional data set, Mi is basic point which should be found. di(X) is projection from X to Mi, that is coordinates in 3-D space. cpcan be either linear or nonlinear.

D

dl

Figure 1. Graph of 3D Visualization

In the Euclidean space, we can use inner product to achieve this. Inner product similarity is need to define the projection. Certainly, distance is also a similarity measurement. It is 12-norm when using vector to represent data. When the three basic

points are found, the coordinate values of data in 3-D space are distances to these basic points. Then, visualization is achieved. The formula expression is given as follows:

i=I,2,3, Norm measure i=I,2,3, Inner product measure

We can also choose LI-norm, p-norm or infinite norm. For a n-dimension data point z = (Zl, Z2, ... , zn), n and p > 0 are arbitrary integers, the norm is defined as follows:

Ilzllp = ylzf + z� + ... + z� (2)

Different norms have different geometric characteristics [10]. In 3-D visualization, LI-norm represents a polyhedron, L2-norm is a sphere and p-norm is a figure formed by a unit sphere and a unit cube. Infinite norm is a cube. Other measurements can have other geometric characteristics.

3. Basis Vectors

3.1. Determination of The Basis Vectors Based on Kmeans

In our proposed method, the only thing we need to determine is the three basic points. In [4], they proposed that the basic points problem can be solved by optimization method. Here, we use the clustering method. The kmeans method[9] is used to find the clustering centers, which are used as the basic points.

Kmeans algorithm was proposed by I.B MacQueen [11]. It is a typical iterative algorithm based on distance. It uses distance to evaluate similarity. The smaller the distance between two objects is, the higher the similarity is. This algorithm believes that cluster consists of objects which are thought to be close with each other in terms of distance. So the goal is to obtain compact and separable clusters.

The basic idea of this algorithm is to use the sum of distances as an objective function, which is from data point to prototype (the center of a class). The measurement of similarity is a dis­tance in Euclidean space. For some original clustering center vectors, finding the optimal classification is to minimize the ob­jective function. The clustering criterion function uses the error sum of squares criterion function.

This algorithm uses iterative updating method. In every up­date procedure, points are separated into k clusters according to the k cluster centers, i.e. centroid (average value of all the data in a cluster, that is the geometric center )of every cluster is re­calculated as the next reference point. The iteration makes the reference point closer to the real centroid, and lower objective function means better clustering effect.

1542

Page 3: [IEEE 2012 International Conference on Machine Learning and Cybernetics (ICMLC) - Xian, Shaanxi, China (2012.07.15-2012.07.17)] 2012 International Conference on Machine Learning and

Proceedings of the 2012 International Conference on Machine Learning and Cybernetics, Xian, 15-17 July, 2012

3.2. K3DV Algorithm

In summary, the kmeans based 3-D visualization algorithm is :

Algorithm: K3DV

Stepl Data preprocessing, Step2 Seek basic points for visualization using kmeans, Step3 Solving (1) to obtain 3-D coordinates, Step4 Output.

4. Numerical experiments

In order to verify that the above approach is feasible, we do the following experiment. The database used in this article are from the UCI[12] database, including iris, wine and pendigits databases.

4.1. Iris

The iris database[12] contains 150 samples from Setosa, Versicolour and Virginica. Moreover, the database consists of four attributes reflecting sample information, sepal length(SL), sepal width(AW), petal length(PL) and petal width(PW). Their dimension is in cm. Therefore, during the visualization process, we do not require original database to be standardized. Owing to the random initialization of clustering center the following results are shown:

TABLE 1. Clustering Centers, Using Kmeans for Iris Database

Recognition rate(RR) Clustering center 0.9867 SL SW PL PW Setosa 5.0060 3.4280 1.4620 0.2460

Versicolour 5.9157 2.7647 4.2647 1.3333 Virginica 6.6224 2.9837 5.5735 2.0327

According to the clustering centers listed, we adopt different mapping examples of clustering iris database visualization. For Figure 2, we can see that the using of inner product make the data separated, the classification level is very obvious. This is a satisfactory result. Moreover, we show the result using L2 norm in Figure 3. Figure 3 shows that the classification feature of original database will always be kept. Both inner product or norm can distinguish for Setosa. Veisicolour and Virginica have slight overlap, but it is not visible in figures. It is because

of the density of the point sets. So in particular display, we can use local magnified way to distinguish them.

120 Ql

I 100

E � 80 �

60

40 */ 100 80 70

60 ,;t 60 40 40

50

d(2)(cm) 20 30 d(1)(cm)

Figure 2. Iris 3-D visualization result, using inner-product

5 \ .. E4 • � �3 0 + '0 +

+ '0 ++,> ++ 0

5 4

d(2)(cm) 0 0 d(1)(cm)

Figure 3. Iris 3-D visualization result, using L2-norm

4.2. Wine

The wine database[12] is a classification database. It con­tains 178 data samples. These data are the results of a chemical analysis of wines grown in the same region in Italy

1543

Page 4: [IEEE 2012 International Conference on Machine Learning and Cybernetics (ICMLC) - Xian, Shaanxi, China (2012.07.15-2012.07.17)] 2012 International Conference on Machine Learning and

Proceedings of the 2012 International Conference on Machine Learning and Cybernetics, Xian, 15-17 July, 2012

but provided by three different cultivars. The analysis de­termined the quantities of 13 constituents found in each of the three types of wines. The 13 attributes include A1co­hol(ALC),Malic acid(MAA), Ash, A1calinity of ash(ALA), Magnesium(MAG), Total phenols(TPH), Flavanoids(FLA), Nonflavanoid phenols(NFP), Proanthocyanins(PRT), Color in­tensity(COl), Hue, OD280/0D3l 5 of diluted wines(ODD) and Proline(PRO). However, these attributes are in different dimen­sions. Therefore, during the visualization process,we need to standardize the original database. The way we use to standard­ize the data is as follows:

(3)

where,Yij is the standardized value of Xij, Xj stands for the j column of data. After standardization to the data, we perform visualization process. After many attempt to get the following results:

Table 2: Clustering Centers, Using Kmeans for Wine

Database

RR Clustering center

0.9867 ALC MAA ASH ALA

Class 1 0.7057 0.2484 0.5849 0.3444 Class 2 0.3134 0.2356 0.4730 0.5002 Class 3 0.5467 0.4844 0.5616 0.5387

MAG TPH FLA NFP PRT

0.4107 0.6421 0.5547 0.3003 0.4773 0.2455 0.4481 0.3801 0.4187 0.3972 0.3152 0.2467 0.1047 0.6143 0.2254

COl HUE ODD PRO

0.3553 0.4778 0.6904 0.5939 0.1478 0.4722 0.5842 0.1564 0.4888 0.1889 0.1585 0.2491

Figure 4 shows the result using inner product to realize the process of visualization. From the figure, then we can well clas­sify results. Figure 5 shows an example using the L2 norm dis­play. If we have a quality standard, then we can see blue class being partially similarity to red class, but it is obviously differ­ent from the green class. In addition, the green class has out­lierswhich may be caused by sampling error or inferior wine.

4.3. Big Data Sets: Pendigits

5 4

Figure 4. Wine 3-D visualization result, using L2-norm

is given in Figure 6. It shows that the visual effect of the pro­posed method. We can see that the database has high overlap. If we want to divide the database into four types, the classification effectiveness will be very obvious.

5. Conclusions and Outlooks

In this paper, we proposed a 3-D visualization method based on Kmeans. Using Kmeans for original databases, we obtained good clustering centers. Then, we design a map to project data in original space the clustering centers. Data points are as­signed specific weights for these three clustering directions. So we finally get the coordinate values in 3-D space. Visualization is achieved. To verify the feasibility, we make experiments on these databases such as iris and wine. The results are favor­able. Choosing suitable mapping can keep data feature invari­able, like classification feature. According to the 3-D display results, we can also find information like classification, outliers, and classification level when given the level standards.

6. Acknowledgment

This research has been supported by the National Natural This is pen-based recognition of handwritten digits data Science Foundation under Grants(Nos.6100l 200,6l 1Ol 239)

set[12]. It can be used for classification. It consists of 10992 and Natural Science Foundation and Education Department of data points and 16 attributes. The L2 norm based visualization Zhejiang under Grants( Nos.Y610001O).

1544

Page 5: [IEEE 2012 International Conference on Machine Learning and Cybernetics (ICMLC) - Xian, Shaanxi, China (2012.07.15-2012.07.17)] 2012 International Conference on Machine Learning and

Proceedings of the 2012 International Conference on Machine Learning and Cybernetics, Xian, 15-17 July, 2012

0.8

0.6

0.4 . ..

. ,. 0.:

... � .•.........

.... .. . . : ..

.

.

.... . ...

.

..

.

.

.

....

.

..

.

. .. ..

#7 ....

...

..

.

.

.

.

..

.

•..

.

.

.

.

..

.

.

. 0.8 .. . . . ,,, 0.6 .. . . .

0.4 ... .•.. .

0.2 o 0 0.2 0.4 0.6

Figure 5. Wine 3-D visualization result, using inner-product

References

[1] Chen Chaomei, "Top 10 unsolved information visualiza­tion problems", IEEE Computer Graphics and Applica­tions , pp.12-l 6 .J uly/ August2005.

[2] J ain A. K, Murty M. N and Flynn P. J , "Data cluster­ing : a review", ACM Computing Surveys , vol 31 ,No3, pp.264-323,1999.

[3] Santos S .D and Brodlie K , " Gaining understanding of multivariate and multidimensional data through visualiza­tion", Computers &Graphics , vol 28 ,No.1, pp.3ll -325, 2004.

[4] Keinosuke Fukunaga and D. R Olsen, "A Two­Dimensional Display for the Classification of Multivari­ate Data", IEEE Transactions on computers, pp.9l7-923, August 1971.

[5] G. H. Ball, "Promenade-An improved interactive graph­ics man/machine system for pattern recognition", STAN­FORD RESEARCH INST MENLO PARK CALIF , Proj. 6737, Oct. 1968.

[6] J. W. Sammon, Jr., "A nonlinear mapping for data struc­ture analysis", IEEE Transactions on Computers, vol. C-18, pp. 40l-409,May 1969.

[7] R N. Shepard and J. D. Carroll, "Parametric representa­tion of nonlinear data structures", in Proceeding Interna­tional Symp. Multivariant Analysis, P. R Krishnarah, Ed. New York: Academic Press, 1966.

[8] Yen-Yu Lin, Tyng-Luh Liu, and Chiou-Shann Fuh, "Mul­tiple Kernel Learning for Dimensionality Reduction",

0.8

0.6

0.4

0.2

0 0.8 0.5

Figure 6. Pendigits 3-D visualization result, us­ing L2-norm

IEEE Transactions on Pattern Anslysis and Machine In­telligence, vol. 33, pp.1l 47-l l 59, NO. 6, June 2011

[9] Duda RO and Hart P.E, "Pattern classification and scene analysis", New York:John Wiley&Sons, 1973.

[10] Orner Egecioglu, Hakan Ferhatosmanoglu, and Umit Ogras, " Dimensionality reduction and similarity com­putation by inner-product approximations", IEEE Trans­actions on Knowledge and Data EngineeringT, VOL. 16, NO. 6, pp.714-726,June 2004.

[11] J.MacQueen. " Some methods for classification and analy­sis of multivariate abservations", In Pro, of the 5th Berke­ley Symp. On Mathematical Stiatistics and Probability, pp.281-297.University of California Press,1967.

[12] Newman, D.J., Hettich, S., Blake, C.L., Merz, c.J., 1998, UCI Repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science.

1545