Upload
nivedalk
View
188
Download
0
Embed Size (px)
Citation preview
Pairwise Document Similarity in Large
Collections with MapReduceTamer Elsayed, Jimmy Lin, and Douglas Oard
Niveda Krishnamoorthy
Overview Pairwise Similarity MapReduce Framework Proposed algorithm
• Inverted Index Construction• Pairwise document similarity calculation
Results
Pairwise Similarity of Documents
PubMed – “More like this” Similar blog posts Google – Similar pages
MapReduce Programming Framework that supports distributed
computing on clusters of computers Introduced by Google in 2004 Map step Reduce step Combine step (Optional) Applications
MapReduce Model
Example – Word Frequency Consider two files:
Hello
World
Bye
World
Hello
Hadoop
Goodbye
Hadoop
Hello ,2
World ,2
Bye,1
Hadoop ,2
Goodbye ,1
Map Phase
Hello
Hadoop
Goodbye
Hadoop
Hello
World
Bye
World
Map 1
Map 2
<Hello,1>
<World,1>
<Bye,1>
<World,1>
<Hello,1>
<Hadoop,1>
<Goodbye,1><Hadoop,1>
Reduce Phase<Hello,1>
<World,1>
<Bye,1>
<World,1>
<Hello,1>
<Hadoop,1>
<Goodbye,1><Hadoop,1>
<Hello (1,1)>
<World(1,1)>
<Bye(1)>
<Hadoop(1,1)>
<Goodbye(1)>
SHUFFLE
&
SORT
Reduce 2
Reduce 1
Reduce 3
Reduce 4
Reduce 5
Hello ,2
World ,2
Bye,1
Hadoop ,2
Goodbye ,1
Pairwise Document Similarity
MAPREDUCE ALGORITHM•Inverted Index Computation•Pairwise Similarity
Scalable and
Efficient
Constructing Inverted Index (Map Phase)
Document 2BDD
Document 1AABC
Map 1
Map 2
<A,(d1,2)>
<B,(d1,1)>
<C,(d1,1)>
<B,(d2,1)>
<D,(d2,2)>
Document 1ABBE
Map 3
<A,(d3,1)>
<B,(d3,2)>
<E,(d3,1)>
Constructing Inverted Index (Reduce Phase)
<A,(d1,2)>
<B,(d1,1)>
<C,(d1,1)>
<B,(d2,1)>
<D,(d2,2)>
<A,[(d1,2),(d3,1)]>
<B,[(d1,1), (d2,1),(d3,2)]><C,[(d1,1)]>
<D,[(d2,2)]>
SHUFFLE
&
SORT
Reduce 1
Reduce 2
Reduce 3
Reduce 4
<B,[(d1,1), (d2,1),(d3,2)]><C,[(d1,1)]>
<D,[(d2,2)]>
<A,(d3,1)>
<B,(d3,2)>
<E,(d3,1)>
Reduce 5 <E,[(d3,1)]>
<A,[(d1,2),(d3,1)]>
<E,[(d3,1)]>
Space saving technique Group by document ID, not pairs
Golomb’s compression for postings Individual Postings List of Postings
Pairwise document similarity (Map Phase)
<B,[(d1,1), (d2,1),(d3,2)]><C,[(d1,1)]>
<D,[(d2,2)]>
<E,[(d3,1)]>
<A,[(d1,2),(d3,1)]>
Map 1
Map 2
<(d1,d3),2>
<(d1,d2),1(d2,d3),2(d1,d3),2>
Pairwise document similarity (Reduce phase)
<(d1,d3),2>
<(d1,d2),1(d2,d3),2(d1,d3),2>
SHUFFLE
&
SORT
<(d1,d2)[1]>
<(d2,d3)[2]>
<(d1,d3)[2,2]>
Reduce 1
Reduce 2
Reduce 3
<(d1,d2)[1]>
<(d2,d3)[2]>
<(d1,d3)[4]>
Experimental Setup Hadoop 0.16.0 20 machine (4GB memory, 100GB
disk) Similarity function - BM25 Dataset: AQUAINT-2 (newswire text)
• 2.5 GB• 906k documents
Procedure Tokenization Stop word removal Stemming Df-cut
• Fraction of terms with highest document frequency is eliminated – 99% cut (9093)
• 3.7 billion pairs (vs) 8.1 trillion pairs
Linear space and time complexity
Running Time of Pairwise Similarity Comparisons
Effect of df-cut on number of Intermediate pairs
Observations Complexity: O(n2)
Df-cut of 99 percent eliminates meaning bearing terms and some irrelevant terms• Cornell, arthritis• sleek, frail
Df-cut can be relaxed to 99.9 percent
Discussion Exact algorithms used for inverted
index construction and pair-wise document similarity are not specified.
Df-cut – Does a df-cut of 99 percent affect the quality of the results significantly?
The results have not been evaluated.
Thank you