Pairwise document similarity in large collections with map reduce

Pairwise Document Similarity in Large

Collections with MapReduceTamer Elsayed, Jimmy Lin, and Douglas Oard

Niveda Krishnamoorthy

Overview Pairwise Similarity MapReduce Framework Proposed algorithm

• Inverted Index Construction• Pairwise document similarity calculation

Results

Pairwise Similarity of Documents

PubMed – “More like this” Similar blog posts Google – Similar pages

MapReduce Programming Framework that supports distributed

computing on clusters of computers Introduced by Google in 2004 Map step Reduce step Combine step (Optional) Applications

MapReduce Model

Example – Word Frequency Consider two files:

Hello

World

Bye

World

Hello

Hadoop

Goodbye

Hadoop

Hello ,2

World ,2

Bye,1

Hadoop ,2

Goodbye ,1

Map Phase

Hello

Hadoop

Goodbye

Hadoop

Hello

World

Bye

World

Map 1

Map 2

<Hello,1>

<World,1>

<Bye,1>

<World,1>

<Hello,1>

<Hadoop,1>

<Goodbye,1><Hadoop,1>

Reduce Phase<Hello,1>

<World,1>

<Bye,1>

<World,1>

<Hello,1>

<Hadoop,1>

<Goodbye,1><Hadoop,1>

<Hello (1,1)>

<World(1,1)>

<Bye(1)>

<Hadoop(1,1)>

<Goodbye(1)>

SHUFFLE

&

SORT

Reduce 2

Reduce 1

Reduce 3

Reduce 4

Reduce 5

Hello ,2

World ,2

Bye,1

Hadoop ,2

Goodbye ,1

Pairwise Document Similarity

MAPREDUCE ALGORITHM•Inverted Index Computation•Pairwise Similarity

Scalable and

Efficient

Constructing Inverted Index (Map Phase)

Document 2BDD

Document 1AABC

Map 1

Map 2

<A,(d1,2)>

<B,(d1,1)>

<C,(d1,1)>

<B,(d2,1)>

<D,(d2,2)>

Document 1ABBE

Map 3

<A,(d3,1)>

<B,(d3,2)>

<E,(d3,1)>

Constructing Inverted Index (Reduce Phase)

<A,(d1,2)>

<B,(d1,1)>

<C,(d1,1)>

<B,(d2,1)>

<D,(d2,2)>

<A,[(d1,2),(d3,1)]>

<B,[(d1,1), (d2,1),(d3,2)]><C,[(d1,1)]>

<D,[(d2,2)]>

SHUFFLE

&

SORT

Reduce 1

Reduce 2

Reduce 3

Reduce 4

<B,[(d1,1), (d2,1),(d3,2)]><C,[(d1,1)]>

<D,[(d2,2)]>

<A,(d3,1)>

<B,(d3,2)>

<E,(d3,1)>

Reduce 5 <E,[(d3,1)]>

<A,[(d1,2),(d3,1)]>

<E,[(d3,1)]>

Space saving technique Group by document ID, not pairs

Golomb’s compression for postings Individual Postings List of Postings

Pairwise document similarity (Map Phase)

<B,[(d1,1), (d2,1),(d3,2)]><C,[(d1,1)]>

<D,[(d2,2)]>

<E,[(d3,1)]>

<A,[(d1,2),(d3,1)]>

Map 1

Map 2

<(d1,d3),2>

<(d1,d2),1(d2,d3),2(d1,d3),2>

Pairwise document similarity (Reduce phase)

<(d1,d3),2>

<(d1,d2),1(d2,d3),2(d1,d3),2>

SHUFFLE

&

SORT

<(d1,d2)[1]>

<(d2,d3)[2]>

<(d1,d3)[2,2]>

Reduce 1

Reduce 2

Reduce 3

<(d1,d2)[1]>

<(d2,d3)[2]>

<(d1,d3)[4]>

Experimental Setup Hadoop 0.16.0 20 machine (4GB memory, 100GB

disk) Similarity function - BM25 Dataset: AQUAINT-2 (newswire text)

• 2.5 GB• 906k documents

Procedure Tokenization Stop word removal Stemming Df-cut

• Fraction of terms with highest document frequency is eliminated – 99% cut (9093)

• 3.7 billion pairs (vs) 8.1 trillion pairs

Linear space and time complexity

Running Time of Pairwise Similarity Comparisons

Effect of df-cut on number of Intermediate pairs

Observations Complexity: O(n2)

Df-cut of 99 percent eliminates meaning bearing terms and some irrelevant terms• Cornell, arthritis• sleek, frail

Df-cut can be relaxed to 99.9 percent

Discussion Exact algorithms used for inverted

index construction and pair-wise document similarity are not specified.

Df-cut – Does a df-cut of 99 percent affect the quality of the results significantly?

The results have not been evaluated.

Thank you

Technology

Pairwise document similarity in large collections with map reduce