View
286
Download
8
Tags:
Embed Size (px)
DESCRIPTION
Presentation of CSMR Algorithm
Citation preview
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce
Giannakouris – Salalidis Victor - Undergraduate Student
Plerou Antonia - PhD Candidate
Sioutas Spyros - Associate Professor
Introduction• Big Data: Massive amount of data as a result of the huge
rate of growth
• Big Data need to be faced in various domains: Business Intelligence, Bioinformatics, Social Media Analytics etc.
• Text Mining: Classification/Clustering in digital libraries, e-mail, Sentiment Analysis on Social Media
• CSMR: Performs pairwise text similarity, represents text data in a vector space and measures similarity in parallel manner using MapReduce
Background• Vector Space Model: An algebraic model for representing
text documents as vectors
• Efficient method for text similarity measurement
TF-IDF• Term Frequency – Inverse Document Frequency
• A numerical statistic that reflects the significance of a term in a corpus of documents
• Usually used in search engines, text mining, text similarity in the vector space
𝑇𝐹 × 𝐼𝐷𝐹 =𝑛𝑖,𝑗
𝑡 ∈ 𝑑𝑗× 𝑙𝑜𝑔
|𝐷|
|𝑑 ∈ 𝐷: 𝑡 ∈ 𝑑|
Cosine Similarity• Cosine Similarity: A measure of similarity between two
documents represented as vector
• Measuring of the angle between two vectors
11 2 2
1 1
cos(A,B)|| A || || B ||
( ) ( )
ni i
ni
i i
i i
A BA B
A B
Hadoop• Framework developed by Apache
• Large-Scale Data Processing and Analytics
• Scalable and parallel processing of data on large computer clusters using MapReduce
• Runs on commodity, low-end hardware
• Main Components: HDFS (Hadoop Distributed File System), MapReduce
• Currently used by: Adobe, Yahoo!, Amazon, eBay, Facebook and many other companies
MapReduce• Programming Paradigm running on Apache Hadoop
• The main component of Hadoop
• Useful for processing of large data-sets
• Breaks the data into key-value pairs
• Model derived from map and reduce functions of Functional Programming
• Every MR program constitutes of Mappers and Reducers
MapReduce Diagram
CSMR• The purposed method, CSMR combines all the above
mentioned techniques
• Scalable Algorithm for text clustering using MapReduce model
• Applies MR model on TF-IDF and Cosine Similarity
• 4 Phases:
1. Word Counting
2. Text Vectorization using term frequencies
3. Apply TF-IDF on document vectors
4. Cosine Similarity Measurement
Phase 1: Word CountingAlgorithm 1: Word Count
1: class Mapper
2: method Map( document )
3: for each term ∈ document
4: write ( ( term , docId ) , 1 )
5:
6: class Reducer
7: method Reduce( ( term , docId ) , ones[ 1 , 1 , … , n ] ) 8: sum = 0
9: for each one ∈ ones do
10: sum = sum +1
11: return ( ( term , docId ) , o ) 12:
13: /* { o ∈ N : the number of occurrences } */
Phase 2: Term Frequency
Algorithm 2: Term Frequency
1: class Mapper
2: method Map( ( term , docId ) , o )
3: for each element ∈ ( term , docId )
4: write ( docId, ( term, o ) )
5:
6: class Reducer
7: method Reduce( docId, (term, o) )
8: N = 0
9: for each tuple ∈ ( term, o ) do
10: N = N + o
return ( (docId, N), (term, o) )
Phase 3: TF-IDFAlgorithm 3: Tf-Idf
1: class Mapper
2: method Map( ( docId , N ), ( term , o ) )
3: for each element ∈ ( term , o )
4: write ( term, ( docId, o, N ) )
5:
6: class Reducer
7: method Reduce( term, ( docId , o , N ) )
8: n = 0 9: for each element ∈ ( docId , o , N ) do
10: n = n + 1
11: tf = o / N
12: idf = log | D | /(1 n)
13: return ( docId, ( term , tf×idf ) )
14:
15: /* Where |D| is the number of documents in the corpus */
Phase 4: Cosine SimilarityAlgorithm 4: Cosine Similarity
1: class Mapper
2: method Map( docs )
3: n = docs.length
4:
5: for i = 0 to docs.length
6: for j = i+1 to docs.length
7: write ( ( docs[i].id, docs[j].id ),( docs[i].tfidf, docs[j].tfidf ) )
8:
9: class Reducer
10: method Reduce( ( docId_A, docId_B ),( docA.tfidf, docB.tfidf ) )
11: A = docA.tfidf
12: B = docB.tfidf
13: cosine = sum( A×B )/ (sqrt( sum(A2) )× sqrt( sum(B2) ))
14: return ( (docId_A, docId_B), cosine )
Phase 4: Diagram
Doc1,Doc2
[Doc1 TF-IDF], [Doc2 TF-IDF]
Doc1,Doc3
[Doc1 TF-IDF], [Doc3 TF-IDF]
Doc1,Doc4
[Doc1 TF-IDF], [Doc4 TF-IDF]
Doc4,Doc10
[Doc4 TF-IDF], [Doc10 TF-IDF]
DocM,DocN
[DocM TF-IDF], [DocN TF-IDF]
Map
Doc1,Doc2
Cosine(Doc1, Doc2)
Doc1,Doc3
Cosine(Doc1, Doc3)
Doc1,Doc4
Cosine(Doc1 ,Doc4)
Doc4,Doc10
Cosine(Doc4, Doc10)
DocM,DocN
Cosine(DocM, DocN)
Reduce
Input Output
Conclusions & Future Work• Finalized proposed method
• Implementation of the method
• Experimental tests on real data and computer clusters
• Deployment of an open-source project
• Additional implementation using more efficient tools such as Apache Spark and Scala
• Publication of test results