document clustering based on term frequency and inverse therapy

Document Clustering Based on Term Frequency and Inverse Document Frequency

S.Suseela, .Assistant Professor/cse, Periyar Maniammai [email protected]

Abstract- In various organizations such as office, college, industry, etc, we are maintaining lot of information such as Medical documents, financial documents and various document collection. It contains, billion of information, it has been increasing exponentially. One way to deal with the extraordinary flood of data is cluster analysis. This fact has lead to the need to organize a large set of documents into categories through clustering. It is believed that grouping similar documents together into clusters will help the users find relevant information quicker, and will allow them to focus their search in the appropriate direction. It is used to divide large unstructured document corpora into groups of more or less closely related documents. We propose a new similarity measure to compute the similarity of text-based documents based on the Term Frequency and Inverse Document Frequency using Vector Space Model. We apply the new similarity measure to the HAC algorithm, (Hierarchical Agglomerative Clustering Algorithm), and develop a new Clustering Approach (DC_TFIDF). These models will provide accurate document similarity calculation and also improve the effectiveness of the clustering technique over the traditional methods.

keywords: Cluster Analysis, Similarity Measure Vector Space Model, Term frequency and Inverse Document frequency

I.INTRODUCTION

Since nowadays most of the information content is still available in textual form, text is an important basis for information retrieval. Natural language text carries a lot of meaning, which still cannot fully be captured computationally. Therefore information retrieval systems are based on strongly simplified models of text, ignoring most of the grammatical structure of text and reducing texts essentially to the terms they contain [1]. This approach is called full text retrieval and is a simplification that has proven to be very successful. Nowadays this approach is gradually extended by taking into account other features of documents, such as the document or link structure.

Text mining shares many concepts with traditional data mining methods. Data mining includes many techniques that can unveil inherent structure in underlying data; one of these techniques is clustering. When applied to textual data, clustering methods try to identify expected groupings of the text documents so that a set of clusters is produced in which clusters exhibit high intracluster similarity and low intercluster similarity.[2]. Generally, text document clustering methods attempt to segregate the document into groups where each groups represents some topic that is different than those topic represented by the other groups.[3].The Application of document clustering are “Clustering of retrieved documents to present organized and understandable results to the user ,Clustering documents in a collection(eg, Digital

Libraries),automated(or semi automated)creation of document taxonomies(Eg,Yahoo and Open Directory Styles),and efficient information retrieval by focusing on relevant subsets(cluster) rather than whole Collections.In general, the clustering techniques are based on four concepts: data representation model, similarity measure, clustering model, and clustering algorithm. Most of the current document clustering methods are based on the Vector Space Document (VSD) model [4].T he common framework of this data model starts with a representation of any document as a feature vector of the words that appear in the documents of a data set. A distinct word appearing in the documents is usually considered to be an atomic feature term in the VSD model, because words are the basic units in most natural languages (including English) to represent semantic concepts. In particular, the term weights (usually tf-idf, term-frequencies and inverse document-frequencies) of the words are also contained in each feature vector [6]. The similarity between two documents is computed with one of the several similarity measures based on the two corresponding feature vectors, e.g., cosine measure, Jaccard measure, and euclidean distance. To achieve a more accurate document clustering, a more informative feature term—phrase—has been considered in recent research work and literature. A phrase of a document is an ordered sequence of one or more words [7]. Bigrams and trigrams are commonly used methods to extract and identify meaningful phrases in statistical natural language processing [8].

II.DOCUMENT CLUSTERING

Document clustering is an automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters. Unlike document classification no labeled documents are provided in clustering; hence, clustering is also known as unsupervised learning. Text Clustering applied into decision tree[9], Statistical Analysis[10], Neural Networks[11], inductive logic programming and rule based System, and research area such as Database, Information retrieval(IR),and artificial intelligence(AI),and Natural Language Processing.

Any clustering technique relies on four concepts:

1. A data representation model2. Similarity Measure3. A cluster model, and4. A clustering algorithm that builds the clusters

using the data model and similarity measure.

Similarity between documents is measured using one of several similarity measures that are based on such a feature vector. The following picture shows us how the clustering the documents is achieved.

mailto:[email protected]

The clustering the documents can be visualized as a dendrogram in fig.2.1.A tree like diagram that records the sequence of merges or splits. A clustering of the data objects is obtained by cutting the dendrogram at the desired.

III.VECTOR SPACE MODEL

Vector space model (or term vector model) is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings.A document is represented as a vector. Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. Several different ways of computing these values, also known as (term) weights, have been developed. One of the best known schemes is tf_idf weighting .The definition of term depends on the application. Typically terms are single words, keywords, or longer phrases. If the words are chosen to be the terms, the dimensionality of the vector is the number of words in the vocabulary (the number of distinct words occurring in the corpus).A similarity measure, as a computable numeric value, is defined using the query and document-vector to express a likeness (closeness) between a query and a document. The similarity measure typically has the following three basic properties: (i) It usually takes on values between 0 and 1; (ii) Its value does not depend on the order in which the query and the document are compared (when computing the similarity); (iii) It is equal to 1 when the query- and document-vectors are equal. Those documents are said to be retrieved in response to a query for which the similarity measure exceeds a threshold value. The main advantages of tf_idf is term-weighting improves quality of the answer set, partial matching allows retrieval of docs that approximate the query conditions, cosine ranking formula sorts .The vector space model procedure can be divided in to three stages. The first stage is the document indexing where content bearing terms are extracted from the document text. The second stage is the weighting of the indexed terms to enhance retrieval of document relevant to the user. The last stage ranks the document with respect to the query according to a similarity measure.

IV.HIERARCHICAL AGGLOMERATIVE CLUSTERING

After find the cosine similarity, we have to apply the similarity values in HAC algorithm and find the clusters. The following procedures are followed to automatically group related documents into clusters. The first step to create N*N doc-doc matrix. Second one, each document starts as a cluster of size one. Then combine the two clusters with greatest similarity. Then continue the process until there is only one cluster. The Fig7.1 shows us how the clustering is achieved through Simple HAC algorithm.

Fig 7.1 Simple Hac Algorithm

Figure 4 .1.HAC algorithm

V. DOCUMENT SIMILARITY BASED ON TERM FREQUENCY AND INVERSED DOCUMENT FREQUENCY(CSDC_TFIDF)

In the model, each document d is considered to be a vector in the M-dimensional term space. In particular, we usually employ the term tf-idf weighting scheme , in which each document can be represented as D={w(1,d),w(2,d),…….,w(M, d)} (5.1)Where w(i, d) = (1+log tf (i, d)).log (1+N/df (i)), (5.2)

And tf(i ,d) is the frequency of the ith term in the document d, and df(i) is the number of documents containg the ith term.

In the VSD model, the cosine similarity is the most commonly used measure to compute the pair wise similarity of two document di and dj, which is defined as

D1=w11, w12…w1t (5.3)

D2=w21,w22,…..,w2t. (5.4)

1 3 2 5 4 60

0.05

0.1

0.15

0.2

Figure2.1 Agglomerative and divisive

SimpleHAC (d1... dN)

Step1: for n ß 1 to N

Step2: do for i ß 1 to N

Setp3: do C[n][i] ß Sim(dn,di)

Step4: I[n] ß 1

Step5: A ß [ ]

Step6: for k ß 1 to N-1

Step7: do <i,m> ß argmax{<i,m>: i≠m and I[i]=1 and

I[m]=1}C[i][m]

Step8: Append(<i,m>)

Step9: for j ß 1 to N

Step10: do C[i][j] ß Sim(j,i,m)

(C[i][j]:Similarity between cluster i and j)

Step11: C[j][i] ß Sim(j,i,m)

Step12: I[m] ß 0 (I:indicate which clusters

are still available to be merged.

Step13: return A (A: list of clusters).

Fig.5.1. DC_TFIDF

Before Document Clustering, a document “cleaning” Procedure is executed for all documents in the document sets. First, all non word tokens are striped off. Second, the text is parsed into words. Third, all stop words are identified and removed. Finally all stemmed words are concatenated into new documents. Using these Vector Space Model, the words are represented as feature vectors. Then apply the equation 5.1 & 5.2 then find the corresponding weights for the each terms in the documents. After obtaining the term weights we have to apply the cosine similarity, and then find the similar documents. If the cosine similar are equal to one, the two documents are similar otherwise not similar.

VI.PROCEDURE CSDC_TFIDF()

VII.EXPERIMENTAL RESULT

Step1:Let us take two documents D1,D2 Over Document SetsD1 D2X1.The cat ate cheese too.X2.The cat ate milk also.X3The cat ate mouse too.

Y1.The cat ate cheese too.Y2.The cat ate milk also.Y3.The cat ate mouse too.

Table 7.1.

Step2:After removing the stop words areD1 D2X1. cat ate cheese.X2. cat ate milk also.X3 cat ate mouse too.

Y1. cat ate cheese Y2. cat ate milk Y3. cat ate mouse

Table 7.2Step3:Draw the term Document Matrix for D1,D2Xi1, Xi2, Xi3.,…. Xin and Yi1, Yi2, Yi3,…,Yin are document identifiers

Document D1Term Term

FrequencyDocument Frequency

X11 1 3X12 1 3X13 1 1X21 1 3X22 1 3X23 1 3X31 1 3X32 1 3X33 1 1 Table 7.3

Document D2Term Term

FrequencyDocument Frequency

Y11 1 3Y22 1 3Y13 1 1Y21 1 3Y22 1 3Y23 1 3Y31 1 3Y32 1 3Y33 1 1

Table 7.4

step 4:Find the weigth for each term Over D1.w(i, d) = (1+log tf (i, d)).log (1+N/df (i)),

W(x11,D1)= (1+Log tfx11,D1)*(Log(1+N/df(x11,D1)=(1+logtf(1)*(log(1+3/3)

=1.287682

Procedure CSDC_TFIDF()

Input: document sets

Output: Similar Documents.

1: N ! Number of documents.

2: Idf !Inverse Document Frequency.

3: Df!Document Frequency.

4:di!Document Identfiers.

5: For each D do

6: For each di do

7: Read the document from left to right over the

Document D.

8: Calculate the term frequency for each term di € D

9: Calculate the Inverse Document Frequency for

each term di € D

10: calculate the Cosine Similarity

Cos-sim=|dx|.|dy /|dx|*|dy|

11: if (cos_sim==1)The two documents are similar.elseThe two documents are not similar.end if

end for

Document sets

!

Clustered Documents

REMOVING THE STOP WORDS

VECTOR SPACE MODEL

Cosine Similarity Calculation

Clustering (CSDC_TFIDF)

!!!!!!

Weight for term

weight

W(x11,D1 1.287682W(x12,D1 1.287682W(x13,D1 2.386294W(x21,D1 1.287682W(x22,D1 1.287682W(x23,D1 1.287682W(x31,D1 1.287682W(x32,D1 1.287682W(x33,D1 2.386294 Table 7.5

Find the weigth for each

term Over D2.Weight for term

weight

W(y11,D1 1.287682W(y12,D1 1.287682W(y13,D1 2.386294W(y21,D1 1.287682W(y22,D1 1.287682W(y23,D1 1.287682W(y31,D1 1.287682W(y32,D1 1.287682W(y33,D1 2.386294 Table 7.6

Step5: Apply the cosine similarity

Cos_sim=x11y11+x12y12+x13y13+x21y21+x22y22+x23y23/sqtr(x112+x12

2

+x132+ x21

2+ x222+ x23

2)*(y112+y12

2+xy132+yx21

2+ y222+

y232)

Cos_sim=1 .

Thus the two documents are D1 and D2 are Similar Documents. Like wise find the values for all the term in the term document matrix. Then apply the hierarchical agglomerative algorithm, and then find the clustered Documents.

VIII.EVALUATION METRICS

A.Entropy

One external measure is entropy, which provides a measure of “goodness” for unnested clusters or for the clusters at one level of a hierarchical clustering. Entropy tells us how homogeneous a cluster is. The higher the homogeneity of a cluster, the lower the entropy is, and vice versa. The entropy of a cluster containing only one object (perfect homogeneity) is zero. Let P be a partition result of a clustering algorithm consisting of m clusters. For every cluster j in P we compute ij p, the probability that a member of

cluster j belongs to class i. The entropy of each cluster j is calculated the standard formula

where the sum is taken over all classes. The total entropy for aset of clusters is calculated as the sum of entropies for each cluster weighted by the size of each cluster:

where Nj is the size of cluster j, and N is the total number of data objects. We would like to generate clusters of lower entropy, which is an indication of the homogeneity (or similarity) of objects in the clusters. The weighted overall entropy formula avoids favoring smaller clusters over larger clusters.B. F_MeasureThe second external quality measure is the F-measure, a measure that combines the precision and recall ideas from information retrieval literature. The precision and recall of a cluster j with respect to a class i are defined as:

P=Precision (i , j)=Nij/Ni R=Recall (i , j)=Nij /Nj

where Nij is the number of members of class I in cluster j, Nj is the number of members of cluster j, and Ni is the number of members of class i. The F-measure of a class i is defined as:

F (i) =2PR/P+R

IX.RELATED WORK

In the traditional document models such as the VSD model, words or characters are considered to be the basic terms in statistical feature analysis and extraction. To achieve a more accurate document clustering, enveloping more informative features has become more and more important in information retrieval literature recently. Bigrams, trigrams, and much longer n-grams have been commonly used in statistical natural language processing. Recently the most similar work going on, which compares four similarity measures on a collection of documents such as Dice coefficient, Jaccard measure, Cosine similarity, Euclidean distance.

X.CONCLUSIONSThe traditional VSD model plays important roles in text-based information retrieval. We proposed the vector based document models and cosine similarity measures are a highly accurate and efficient practical document clustering solution. The CSDC-TFIDF system definitely improves efficiency and effectiveness in information retrieval system. It also reduces the search space. The proposed system, clustering the documents and also give users an overview of the contents of a document collection. If a collection is well clustered, we can search only the cluster that will contain relevant documents. Searching a smaller collection should improve effectiveness and efficiency.

XI.REFERENCES

[1]Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval(ACM Press Series), Addison Wesley, 1999.

[2]K.Cios,W.Pedrycs,and R.Swiniarski,Data Mining Methods for Knowledge Discovery.Boston:Kluwer ACADEMIC publishers,1998.[3]W.B.Frakes and R.Baeza_Yates, Information Retrieval:Data Structure and Algorithm.Englewood Cliffs,N.J.Prentice Hall,1992[4]. Porter. M, “New Models in Probabilistic Information Retrieval,” British Library Research and Development Report, no. 5587, 1980.[5]. Salton.G, Wong.A, and Yang C.S, “A Vector Space Model forAutomatic Indexing,” Comm. ACM, vol. 18, no. 11, pp. 613-620,1975.[6]. Ukkonen.E, “On-Line Construction of Suffix Trees,” Algorithmica, vol. 14, no. 3, pp. 249-260, 1995.[7]. Salton.G, Wong.A, and Yang C.S, “A Vector Space Model forAutomatic Indexing,” Comm. ACM, vol. 18, no. 11, pp. 613-620,1975.[8]. Yamamoto.M and Church K.W, “Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus,” Computational Linguistics, vol. 27, no. 1,pp. 1-30, 2001.[9].S.Dumies,J.Platt,D.Heckerman,and M.Shami,”Inductive Learing Algorithms and Representaions for Text Categorization”,Proc,Seventh Int’l Conf.Information and Knowledge Manegement ,pp.148-15,Nov.1998.

[10].D.Freitag & A.McCallum,”Information Extraction with HMMs and Shrinkage,”Proc.AAAI-99 Workshop Machine Learing for Information Extraction, pp.31-36, 1999.[11].T.Honela,S,Kaski,K.Lagus,and T.Kohonen,”WEBSOM-self Organizing Maps of Document Collection,”Proc,WSOM,’97,Workshop of self Organizing Mas,p.310-315,June 1997. [12] O. Zamir and O. Etzioni, “Grouper: A Dynamic Clustering Interface to Web Search results,” Computer Networks, vol. 31,nos. 11-16, pp. 1361-1374, 1999[13]. E. Charniak, Statistical Language Learning. MIT Press, 1993.