K-Means Clustering and Affinity Clustering based on Heterogeneous Transfer Learning

7/29/2019 K-Means Clustering and Affinity Clustering based on Heterogeneous Transfer Learning

1/7

K-Means Clustering and Affinity Clusteringbased on Heterogeneous Transfer Learning

Shailendra Kumar Shrivastava, Dr. J. L. Rana, and Dr. R.C. Jain

Abstract - Heterogeneous Transfer Learning aims to extract the knowledge form one or more tasks from same feature space

and applies this knowledge to target task of another features space. In this paper two clustering algorithms K-means clustering

and Affinity clustering both based on Heterogeneous Transfer Learning (HTL) have been proposed. In both the algorithms anno-

tated image datasets are used. K-means based on HTL first finds the cluster centroid of Text (annotations) by K-Means. In the

next step these centroids of Text are used to initialize the centroids in image clustering by K-means. Second algorithm, Affinity

clustering based on HTL first finds the exemplar of annotations and then these exemplar of annotations are used to initialize the

similarity matrix of image datasets to find the clusters. F-Measure Scores and Purity scores increase and Entropy Scores de-

creases in both the algorithms. Clustering accuracy of affinity based on HTL is better than K-Means based on HTL.

Key words- Heterogeneous Transfer learning, clustering, affinity propagation, K-Means, feature space.

1 INTRODUCTIONn the Literature[1] Machine Learning is defined as:A computer program is said to learn from experi-ence E with respect to some class of tasks T and per-

formance measure P, if its performance at tasks in T, asmeasured by P, improves with experience E.

However many machine learning methods workwell only under assumption, that the training data andtesting data are drawn from same feature space. If fea-ture space is different in training and testing data,most statistical models will not work. In this case oneneeds to recollect the training and testing data in samefeatures space and rebuild the model. But this is ex-

pensive and difficult. In such cases transfer learning [3]between task domains is desirable. Transfer learningallows the domain tasks, and distribution used intraining and testing to be different. In heterogeneoustransfer learning, the knowledge is transferred acrossthe domains or tasks that have different features spacee.g. classifying the web pages in Chinese using thetraining document in English [4]. Probabilistic latentsemantic analysis (PLSA) [5] was used in clusteringimages by using the annotation (Text). Transfers learn-ing in Machine Learning technologies [2] have alreadyachieved significant success in many knowledge engi-neering areas including classification, regression and

clustering.Clustering is a fundamental task in computerizeddata analysis. It is concerned with the problem of parti-tioning a collection of data points intogroups/categories using unsupervised learning tech-

niques. Data points in groups are similar. Such groupsare called clusters [6][7][8]

In this paper two algorithm, K-Means [8][9] basedon Heterogeneous Transfer Learning and Affinity clus-tering based on transfer learning are proposed. Affini-ty propagation [6] is a clustering algorithm which forgiven set of similarities (also denoted by affinities) be-tween pairs of data points, partitions the data by pass-ing the messages among the data points. Each partitionis associated with a prototypical point that best de-scribes that cluster. AP associates each data point withone such prototype. Thus, the objective of AP is to

maximize the overall sum of similarities between datapoints and their representatives. In K-Means startswith random initial partitions and keeps reassigningthe patterns to clusters based on similarity betweenpattern and centroids until a convergence criterion ismet.

In annotated image dataset has two features space.First one is text another one is image feature space. InK-means Text data (annotations) is used to find theclusters by K-Means. In order, to transfer knowledgeof text features space into image feature space firstfinds the centroid of annotations by K-Means. Nowcorresponding to Text (annotations) centroids, images

centroids become available. Next we take completeimage data sets and assign it to the centroid on thebasis of minimum Euclidean distance and finally applyK-Means to generate the image clusters. In Affinityclustering based on HTL we use Text (annotations) ofimages to find exemplars by affinity propagation clus-tering. For transferring the knowledge form text fea-tures space to image feature space, we initialize theimage similarity matrix diagonal by exemplar of textclustering then generate the images clusters of imagesimilarity matrix by affinity propagation clustering.

The remainder of this paper is organized as follows.Section 2 gives a brief over view of Transfer Learning,

Shailendra Kulmar Shrivastava is with the Department of InformationTechnology, Samrat Ashok Technological Institute, Vidisha, M.P.464001,India

Dr.J.L.Rana,Ex Head of Department of Computer Sc. & Engineering waswith the M.A.N.I.T., Bhopal, India

Dr. R.C.Jain,Director , is with the Samrat Ashok Technological Institute,Vidisha, M.P.464001, India

I

JOURNAL OF COMPUTING, VOLUME 5, ISSUE 1, JANUARY 2013, ISSN (Online) 2151-9617

https://sites.google.com/site/journalofcomputing

WWW.JOURNALOFCOMPUTING.ORG 31

2013 Journal of Computing Press, NY, USA, ISSN 2151-9617


2/7

original Affinity Propagation algorithm and Vectorspace model. Section 3 describes the main idea anddetails of our proposed algorithms. Section 4 discussesthe experimental results and evaluations. Section 5provides the concluding remarks and future directions.

2 RELATED WORKS

Before going into details of our proposed K-Meansbased on Heterogeneous Transfer learning and Af-finity Clustering Based on Heterogeneous TransferLearning algorithms, some works that are closely re-lated to this paper are briefly reviewed. TransfersLearning, K-Means clustering algorithm, affinity prop-agation algorithm and vector space model will be dis-cussed.

2.1 Transfer Learning

Machine Learning methods work well only undercommon assumption, the training and test data fromsame features space and same distribution. When dis-

tributions changes, most statistical models need to berebuilt from scratch using newly collected trainingdata. In many real world applications, it is expensiveor impossible to re-collect the needed training data andrebuild the model. It would be nice to reduce the needand effort to re-collect training data. In such casesKnowledge transfer or Transfer Learning [3] betweentasks domain would be desirable. Transfer learninghas following three main research issues (1) What totransfer (2) How to transfer (3) When to transfer. Inthe inductive transfer learning setting, the target task isdifferent form source task, no matter source and targetdomain is the same or not. In the transductive transfer

learning, the source and target tasks are same, whilesource and target domain are different. In the unsu-pervised transfer learning setting, similar to inductivetransfer learning setting, target task is different butrelated to source tasks. In the heterogeneous transferlearning, transfer the knowledge across domain or taskthat has different feature space.

2.2 K-Means Clustering Algorithm

K-Means[8][9] algorithm is one of the best known andmost popular clustering algorithms K-Means seeksoptimal partition of the data by minimizing the sum ofsquared error criterion, with an iterative optimizationprocedure. The K-Mean clustering procedure is as fol-

lowing.1. Initialize a K - partition randomly or based on

some prior knowledge. Calculate the clusterprototype matrix M = [ m 1 , , m K ]

2. Assign each object in the data set to the near-est cluster C

3. Recalculate the cluster prototype matrix basedon the current partition,

4. (1)5. Repeat steps 2 and 3 until there is no change

for each cluster.Major Problem with this algorithm is that it is sensi-

tive to selection of initial partition.

2.3 Affinity Clustering Algorithm

Affinity clustering algorithm [10][11][12] is based onmessage passing among data points. Each data pointreceives the availability from others data points (fromexemplar) and sends the responsibility message to oth-ers data points (to exemplar). Sum of responsibilities

and availabilities for data points identify the exem-plars. After the identification of exemplar the datapoints are assigned to exemplar to form the clusters.Following are the steps of affinity clustering algo-rithms.

1. Initialize the availabilities to zero , 02. Update the responsibilities by following

equation. , ,

max , , (2)Where , is the similarity of data point iand exemplar k.

3. Update the availabilities by following equa-

tion , 0, , 0, , .., (3)Update self-availability by following equa-tion

, max0, , (4)4. Compute sum = , , for data

point i and find the value of k that maximizethe sum to identify the exemplars.

5. If Exemplars do not change for fixed numberof iterations go to step (6) else go to Step (1)

6. Assign the data points to Exemplars on the

basis of maximum similarity to find clusters.

2.4 Vector Space Model

Vector space model [13] uses to represent the text doc-uments. In VSD each document d is considered as avector in the M-dimensional term (word) space. In thealgorithm the tf-idf weighing scheme is used. In VSDmodel each document represented by following equa-tion. 1, , 2, , . , (5)

Where, N is the number of terms (words) in thedocument. And

, 1 log , log1 / (6)Where , frequency of i

th

term in the docu-ment d and df(i) is the number of document containing

i t h term . Inverse document frequency (idf) is de-

fined as the logarithm of the ratio of number of docu-

ments (N) to the number of document containing the

given word (df).

3 CLUSTERING BASED ON HETEROGENOUSTRANSFER LEARNING

In this section, two algorithm of clustering based onheterogeneous transfer learning are proposed. First is






3/7

K-Means clustering based on heterogeneous TransferLearning and second is Affinity Propagation Cluster-ing Based on Heterogeneous Transfer Learning.

3.1 K-Means Clustering based onHeterogeneous Transfer Learning

K-Means clustering based on heterogeneous transferlearning extends the K-Means for clustering. Annotat-

ed image data set has been used in simulation studies.In this annotations (Text features space) and images(image features space) are computed. K-Means cluster-ing is applied to text (annotation) of images to find thecentroid. In order to transfer knowledge from one taskto another task, first step is to initialize the centroid inimage clustering by the centroid obtained in Text clus-tering. For text clustering phrase base VSD [13] isused. In vector space model w (d, i), term frequencyand document frequency are calculated on the basis ofterm. Term in vector space model is word. But thephrase instead of word will be used. This can be calledVector space model bases on the phrase. Phrase (Term)

frequency and document frequency can be calculatedby suffix tree. Here document frequency is a number,that a document contains the phrase. Generate the cen-troid of annotations by K-Means algorithm giving in-put VSD. Now apply the K-Means clustering to imagedata sets by initializing the centroid by the centroidobtained in text clustering. Proposed K-means cluster-ing algorithm based on heterogeneous transfer learn-ing can be written as following.

1. Input Annotations(Text) for Clustering2. Text preprocessing.

Removing all stop words.

Words steaming are done.3. Find the words and assign the unique numberto each word.

4. Convert text into sequence of number.5. Suffix tree construction using Ukkonen algo-

rithm.6. Calculate the Phrase (term) frequency from

suffix tree.7. Calculate the document frequency of phrase

from suffix tree.8. Construct the Vector space model of text using

phrase.9. Apply k-means on VSD.10. Initialize the centroid in image domain by cen-

troid obtained from text clustering.11. Apply K-means in Image data sets to find

clusters.

3.2 Affinity Clustering based on HeterogeneousTransfer Learning

Affinity clustering based on heterogeneous transferlearning extends the affinity propagation clustering.Annotated image data set is used. In these annotations(Text features space) and images (image featuresspace) form the starting point. Affinity clustering isapplied to annotation (text) of images to find the ex-

emplar. In order to transfer knowledge from one taskto another task, diagonal values of similarity matrix ofimage data sets are assigned on the bases of exemplarof text clustering. For text clustering phrase base VSDis used. In vector space model w (d, i), term frequencyand document frequency is calculated on the basis ofterm. Term in vector space model is word. But thephrase is used instead of word. This can be called Vec-tor space model bases on the phrase. Phrase (Term)frequency and document frequency can be calculatedby suffix tree. Here document frequency is a numberdocument contains the phrase. VSD model based onthe phrase is used to compute the cosine similarity [1].Similarity of two document di and dj is calculated byequation (7) and document can be represented byequation (4).

, ||

(7)

Self-similarity/Preference [9] is finding from by equa-tion (8).

simd, d ,,,- 1 k N (8)Affinity propagation algorithm for clustering is ap-

plied to generate the exemplar. Extract the features ofimage data sets to make the features vector space ofimage data set. Next finds the similarity matrix fromimage vectors. Assign the diagonal value of similaritymatrix of image domain on the bases of exemplar ofText clustering, which transfer the knowledge fromone domain to other domain. Generate the exem-plars/clusters by affinity propagation clustering algo-rithm. Proposed algorithm can be written as follow-ing.

1. Input Annotations(Text) for Clustering2. Text preprocessing.

Removing all stop words.Words steaming are done.

3. Find the words and assign the unique numberto each word.

4. Convert text into sequence of number.5. Suffix tree construction using Ukkonen algo-

rithm.6. Calculate the Phrase (term) frequency from

suffix tree.7. Calculate the document frequency of phrase

from suffix tree.8. Construct the Vector space model of text using

phrase.9. Find the Phrase based similarity matrix of

documents from vector space model by equa-tion 7.

10. Preference in similarity matrix is assigned byequation 8.

11. Initialize the availabilities to zero ai, k 012. Update the responsibilities by equations (2).13. Update the availabilities by equation (3).14. Update self-availability by equation (4).15. Compute sum = ai, k ri, k for data point

i and find the value of k that maximize thesum to identify the exemplars.






4/7


17. Extract feature vector from image data sets.18. Find the similarity matrix from image feature

vector.19. Transfer the knowledge from text feature

space to image feature space.20. Initialize the availabilities of to zero

, 021. Update the responsibilities by equations 1.22. Update the availabilities by equation 2.23. Update self-availability by equation 3.24. Compute sum = , , for data

point i and find the value ofk that maximizethe sum to identify the exemplars.


26. Assign the data points to Exemplars on thebasis of maximum similarity to find clusters.

4 EXPERIMENTAL RESULTS AND EVALUATIONIn this Section, results and evaluation of set of experi-ments are presented to verify the effectiveness andefficiency of our proposed algorithm for clustering.Evaluations parameters are F-Measures, Purity andEntropy. Experiments have been performed on datasets constructed from Corpus Caltech 256[14]. We willdiscuss Evaluation parameter, Datasets and results.

4.1 Evaluations Parameters [15]

For ready reference definition and formulas of F-Measure, Purity and Entropy are given below.

4.1.2 F-measure

F-Measure combines the Precision and Recall. LetC={ be clusters of data set D of N doc-uments ,and let , represents the cor-rect class of set D. Then the Recall of Cluster j withrespect to Class i is defined as

Recall(i , j )= Then the Precision of Cluster j with respect to Class iis defined as

Precision (i , j )= F-Measures of cluster and class is the combina-tions of Precision and Recall in following manner.

, 2 , , , , F-Measure for overall quality of cluster set C is definedby the following equation

F |C|

N

max.. Fi, j

4.1.3 Purity

Purity indicates the percentage of the dominant classmember in the given cluster. For measuring the overallclustering purity weighted average purity is used. Pu-rity is given by following equation.

max.. ,

4.1.4 Entropy

Entropy tell us the homogeneity of cluster. Higherhomogeneity of cluster entropy should be low and viceversa. Like weighted F-Measure and weighted Purityweighted entropy is used which is given by followingequation

1log

log

Where is the probability that a member of clus-ter belongs to class .

To sum up, we would like to maximize the F-

sMeasure and Purity scores and minimize the entropyscore of cluster to achieve high quality clustering.

4.2 Data Set Preparation

Image data sets of 100,300,500 and 800 Images havebeen consturcted. Images are randomly chosen fromCaltech-256. Manually annotated (text) files are createdfor each datasets.

4.3 Experimental Results Discussion

Extensive experiments are carried out to show the ef-fectiveness of proposed algorithms. Annotations andimages have been combined. Experiments are perform-

ing on following combinations annotations and imag-es. Without annotations ,100 annotations and 100 im-ages ,100 annotations and 300 images ,100 annotations500images ,100 images 800 annotations ,300 annota-tions 300 images ,300 annotations 500images, 300 anno-tations and 800 images ,500 annotations and 500 imag-es ,500 annotations and 800 images. Results of experi-ments are given in Table 1, Table 2, and Table 3. It canbe observed from fig 1, fig 2, fig 3, fig 4, fig 5 and fig 6that in both algorithms the entropy scores, purityscores and entropy scores vary with number of annota-tions and number of images. In both algorithms entro-py scores and purity scores are maximum and entropyscores is minimum on the optimum number of annota-tions. For comparison of K-means clustering based onHTL and Affinity clustering based on HTL are plotted.From fig 7, fig 8 and fig 9 it is observed that F-Measurescores and Purity scores are larger and Entropy scoresis smaller in Affinity Clustering Based on HTL.






5/7

Fig 1: Variation of F-Measure Scores with Annota-tions(Text) in K-Means based heterogeneous transfer

learning clustering

Fig 2: Variation of Purity Scores with Annota-tions(Text) in K-Means based heterogeneous transfer

learning clustering

Fig 3: Variation of Entropy Scores with Annotations(Text) in K-Means based heterogeneous transfer learn-

ing clustering

Fig 4: Variation of F-Measure Scores with Annota-tions(Text) in Affinity clustering based heterogeneous

transfer learning

Fig 5: Variation of Purity Scores with Annota-tions(Text) in Affinity clustering based heterogeneous

transfer learning

Fig 6: Variation of Entropy Scores withAnnotations(Text) in Affinity clustering based

heterogeneous transfer learning






6/7

Numberof An-

notation

No. ofImagesin Data

sets

F-MeasureAP Basedon HTL

F-MeasureK-MeansBased on

HTL

0 100 0.30711 0.26873

100 100 0.43563 0.35208

0 300 0.25227 0.24254

100 300 0.42308 0.35364300 300 0.24565 0.10109

0 500 0.18273 0.18823

100 500 0.41944 0.30234

300 500 0.28443 0.19764

500 500 0.19175 0.18912

0 800 0.18928 0.18064

100 800 0.40586 0.32492

300 800 0.35365 0.26969

500 800 0.16184 0.12764

Table 1: Comparison of F-Measure Scores

Numberof An-

notation

No. ofImagesin Data

sets

PurityAP Basedon HTL

PurityK-MeansBased on

HTL

0 100 0.3700 0.2900

100 100 0.4800 0.3600

0 300 0.2800 0.2000

100 300 0.3907 0.2966

300 300 0.2700 0.2015

0 500 0.1980 0.1680

100 500 0.3362 0.2480

300 500 0.2000 0.1175

500 500 0.1287 0.1060

0 800 0.1900 0.1062100 800 0.2875 0.2537

300 800 0.2025 0.1200

500 800 0.1912 0.1175

Table 2: Comparison of Purity Scores

Numberof Anno-

tation

No. ofImag-es inDatasets

EntropyAP Basedon HTL

EntropyK-MeansBased on

HTL

0 100 0.75162 0.85679

100 100 0.60888 0.70140

0 300 0.80327 0.89764100 300 0.68969 0.79095

300 300 0.78225 0.80882

0 500 0.80658 0.93903

100 500 0.69742 0.83917

300 500 0.77842 0.88506

500 500 0.79886 0.93907

0 800 0.87716 0.95362

100 800 0.74227 0.78226

300 800 0.78725 0.88942

500 800 0.86091 0.97506

Table 3: Comparison of Purity Scores

Fig 7: Comparison of F-Measure Scores withAnnotations(Text) in K-Means clustering based HTL

and AP Based on HTL(Number of Images in Data sets800)

Fig 8: Comparison of Purity Scores withAnnotations(Text) in K-Means clustering based HTL


Fig 9: Comparison of Entropy Scores withAnnotations(Text) in K-Means clustering based HTL







7/7

5 CONCLUDING REMARKS AND FUTUREDIRECTIONS

In this paper two algorithms for clustering. K-MeansClustering based on HTL and Affinity Clustering

based on HTL have been proposed. Clustering Accura-cy of K-Means based on HTL is better than K-Meanswhereas Affinity Clustering based on HTL gives farbetter clustering accuracy than simple Affinity Propa-gation Clustering. It is also concluded that the cluster-ing accuracy of Affinity based on HTL is much betterthan the K-Means Based on HTL. Extensive experi-ments on many datasets show that the proposed Affin-ity based on HTL produces better clustering accuracywith less computational complexity.

There are a number of interesting potential avenuesfor future research. Affinity Clustering based on HTLcan be made hierarchical. Results of FAPML can be

improved by designing it on the basis of HTL. Bothalgorithms can be applied to information retrieval.

REFERENCES

[1] Tom M. Mitchell, Machine Learning ,McGraw-Hill , 1997pp1- 414

[2] EthemAlpaydin , Introduction to Machine Learning ,PrenticeHall of India Private Limited New Dehli,2006,pp133-150.

[3] Sinno Jialin Pan and Qiang Yang ,A Survey on Transfer Learn-ing , IEEE Transactions on Knowledge and Data Engineering

(IEEE TKDE) Volume 22, No. 10, October 2010 ,pp 1345-1359,

[4] X. Ling, G.-R. Xue, W. Dai, Y. Jiang, Q. Yang, and Y. Yu, CanChinese web pages be classified with english data source?,

Proceedings of the 17th International Conference on World Wide

Web, Beijing, China ,ACM, April 2008, pp. 969978.

[5] Qiang Yang, Yuqiang Chen, Gui-Rong, Xue,Wenyuan Dai ,Yong,Heterogeneous Transfer Learning for Image Clustering via theSocial Web,ACL-IJCNLP 2009 ,pp 1-9.

[6] RuiXu Donald C. Winch, Clustering , IEEE Press 2009 ,pp 1-282

[7] Jain, A. and DubesR. Algorithms for Clustering Data , Eng-lewood Cliffs, NJ Prentice Hall, 1988.

[8] A.K. Jain, M.N. Murthy and P.J. Flynn, Data Clustering: A Re-view , ACM Computing Surveys, Vol.31. No 3, September 1999,pp 264-322.

[9] RuiXu, and Donald Wunsch, Survey of Clustering. Algorithms, IEEE Transactions on Neural Network, Vol 16, No. 3, 2005 pp645.

[10] Frey, B.J. and DueckD. Clustering by Passing Messages Be-tween Data Points , Science 2007, pp 972976.

[11] Kaijun Wang, Junying Zhang, Dan Li, Xinna Zhangand Tao Guo,Adaptive Affinity Propagation Cluster-

ing,ActaAutomaticaSinica, 2007 ,1242-1246.[12] Inmar E. Givoni and Brendan J. Frey,A Binary Variable Model

for AffinityPropagation,Journal Neural Computation,Volume 21Issue 6, June 2009,pp1589-1600.

[13] Salton G., Wong A., and Yang C. S., 1975, A Vector SpaceModel for Automatic Indexing, Comm. ACM, vol. 18, no. 11, pp.613-620.

[14] http://www.vision.caltech.edu/Image_Datasets/Caltech256/

[15] Chim H. and Deng X., 2008 Efficient Phrase Based DocumentSimilarity for Clustering, IEEE Trans. Knowledge and Data En-

gineering, vol. 20, No.9.

Shailendra Kumar Shrivastava, B.E.(C.T.),M.E.(CSE) Associ-ate Professor in Department of Information Technology. SamratAshok Technological Institute Vidisha. He has more than 23Years Teaching Experiences. He has published more than 50research papers in National/International conferences and Jour-nals .His area of interest is machine learning and data mining.He

is PhD. Scholar at R.G.P.V.Bhopal

Dr J.L.Rana B.E.M.E.(CSE),PhD(CSE) .Formerly he was Headof Department Computer Science and Engineering , M.A.N.I.T.Bhopal M.P. Inidia.He has more than 40 Years Teaching Ex-perinces. His area of intrest includes Data Mining, Image Pro-cessing, and Ad-Hoc Network etc. He has so many publicationsin International Journal and conferences.

Dr. R.C.Jain PhD .He is the director Samrat Ashok TechnologicalInstitute Vidisha M.P. India.He has more than 35 Years TeachingExperiences. Research Interest includes Data Mining , ComputerGrpahics, Image Processing, Data Mining .He has publishedmore than 250 research papers in Internatinal Journals and Con-ferences.





Documents

K-Means Clustering and Affinity Clustering based on Heterogeneous Transfer Learning