CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce

CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce

Giannakouris – Salalidis Victor - Undergraduate Student

Plerou Antonia - PhD Candidate

Sioutas Spyros - Associate Professor

Introduction• Big Data: Massive amount of data as a result of the huge

rate of growth

• Big Data need to be faced in various domains: Business Intelligence, Bioinformatics, Social Media Analytics etc.

• Text Mining: Classification/Clustering in digital libraries, e-mail, Sentiment Analysis on Social Media

• CSMR: Performs pairwise text similarity, represents text data in a vector space and measures similarity in parallel manner using MapReduce

Background• Vector Space Model: An algebraic model for representing

text documents as vectors

• Efficient method for text similarity measurement

TF-IDF• Term Frequency – Inverse Document Frequency

• A numerical statistic that reflects the significance of a term in a corpus of documents

• Usually used in search engines, text mining, text similarity in the vector space

𝑇𝐹 × 𝐼𝐷𝐹 =𝑛𝑖,𝑗

𝑡 ∈ 𝑑𝑗× 𝑙𝑜𝑔

|𝐷|

|𝑑 ∈ 𝐷: 𝑡 ∈ 𝑑|

Cosine Similarity• Cosine Similarity: A measure of similarity between two

documents represented as vector

• Measuring of the angle between two vectors

11 2 2

1 1

cos(A,B)|| A || || B ||

( ) ( )

ni i

ni

i i

i i

A BA B

A B

Hadoop• Framework developed by Apache

• Large-Scale Data Processing and Analytics

• Scalable and parallel processing of data on large computer clusters using MapReduce

• Runs on commodity, low-end hardware

• Main Components: HDFS (Hadoop Distributed File System), MapReduce

• Currently used by: Adobe, Yahoo!, Amazon, eBay, Facebook and many other companies

MapReduce• Programming Paradigm running on Apache Hadoop

• The main component of Hadoop

• Useful for processing of large data-sets

• Breaks the data into key-value pairs

• Model derived from map and reduce functions of Functional Programming

• Every MR program constitutes of Mappers and Reducers

MapReduce Diagram

CSMR• The purposed method, CSMR combines all the above

mentioned techniques

• Scalable Algorithm for text clustering using MapReduce model

• Applies MR model on TF-IDF and Cosine Similarity

• 4 Phases:

1. Word Counting

2. Text Vectorization using term frequencies

3. Apply TF-IDF on document vectors

4. Cosine Similarity Measurement

Phase 1: Word CountingAlgorithm 1: Word Count

1: class Mapper

2: method Map( document )

3: for each term ∈ document

4: write ( ( term , docId ) , 1 )

5:

6: class Reducer

7: method Reduce( ( term , docId ) , ones[ 1 , 1 , … , n ] ) 8: sum = 0

9: for each one ∈ ones do

10: sum = sum +1

11: return ( ( term , docId ) , o ) 12:

13: /* { o ∈ N : the number of occurrences } */

Phase 2: Term Frequency

Algorithm 2: Term Frequency

1: class Mapper

2: method Map( ( term , docId ) , o )

3: for each element ∈ ( term , docId )

4: write ( docId, ( term, o ) )

5:

6: class Reducer

7: method Reduce( docId, (term, o) )

8: N = 0

9: for each tuple ∈ ( term, o ) do

10: N = N + o

return ( (docId, N), (term, o) )

Phase 3: TF-IDFAlgorithm 3: Tf-Idf

1: class Mapper

2: method Map( ( docId , N ), ( term , o ) )

3: for each element ∈ ( term , o )

4: write ( term, ( docId, o, N ) )

5:

6: class Reducer

7: method Reduce( term, ( docId , o , N ) )

8: n = 0 9: for each element ∈ ( docId , o , N ) do

10: n = n + 1

11: tf = o / N

12: idf = log | D | /(1 n)

13: return ( docId, ( term , tf×idf ) )

14:

15: /* Where |D| is the number of documents in the corpus */

Phase 4: Cosine SimilarityAlgorithm 4: Cosine Similarity

1: class Mapper

2: method Map( docs )

3: n = docs.length

4:

5: for i = 0 to docs.length

6: for j = i+1 to docs.length

7: write ( ( docs[i].id, docs[j].id ),( docs[i].tfidf, docs[j].tfidf ) )

8:

9: class Reducer

10: method Reduce( ( docId_A, docId_B ),( docA.tfidf, docB.tfidf ) )

11: A = docA.tfidf

12: B = docB.tfidf

13: cosine = sum( A×B )/ (sqrt( sum(A2) )× sqrt( sum(B2) ))

14: return ( (docId_A, docId_B), cosine )

Phase 4: Diagram

Doc1,Doc2

[Doc1 TF-IDF], [Doc2 TF-IDF]

Doc1,Doc3


Doc1,Doc4


Doc4,Doc10


DocM,DocN

[DocM TF-IDF], [DocN TF-IDF]

Map

Doc1,Doc2

Cosine(Doc1, Doc2)

Doc1,Doc3

Cosine(Doc1, Doc3)

Doc1,Doc4

Cosine(Doc1 ,Doc4)

Doc4,Doc10

Cosine(Doc4, Doc10)

DocM,DocN

Cosine(DocM, DocN)

Reduce

Input Output

Conclusions & Future Work• Finalized proposed method

• Implementation of the method

• Experimental tests on real data and computer clusters

• Deployment of an open-source project

• Additional implementation using more efficient tools such as Apache Spark and Scala

• Publication of test results

Data & Analytics

CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce