21
Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas Oard Niveda Krishnamoorthy

Pairwise document similarity in large collections with map reduce

Embed Size (px)

Citation preview

Page 1: Pairwise document similarity in large collections with map reduce

Pairwise Document Similarity in Large

Collections with MapReduceTamer Elsayed, Jimmy Lin, and Douglas Oard

Niveda Krishnamoorthy

Page 2: Pairwise document similarity in large collections with map reduce

Overview Pairwise Similarity MapReduce Framework Proposed algorithm

• Inverted Index Construction• Pairwise document similarity calculation

Results

Page 3: Pairwise document similarity in large collections with map reduce

Pairwise Similarity of Documents

PubMed – “More like this” Similar blog posts Google – Similar pages

Page 4: Pairwise document similarity in large collections with map reduce

MapReduce Programming Framework that supports distributed

computing on clusters of computers Introduced by Google in 2004 Map step Reduce step Combine step (Optional) Applications

Page 5: Pairwise document similarity in large collections with map reduce

MapReduce Model

Page 6: Pairwise document similarity in large collections with map reduce

Example – Word Frequency Consider two files:

Hello

World

Bye

World

Hello

Hadoop

Goodbye

Hadoop

Hello ,2

World ,2

Bye,1

Hadoop ,2

Goodbye ,1

Page 7: Pairwise document similarity in large collections with map reduce

Map Phase

Hello

Hadoop

Goodbye

Hadoop

Hello

World

Bye

World

Map 1

Map 2

<Hello,1>

<World,1>

<Bye,1>

<World,1>

<Hello,1>

<Hadoop,1>

<Goodbye,1><Hadoop,1>

Page 8: Pairwise document similarity in large collections with map reduce

Reduce Phase<Hello,1>

<World,1>

<Bye,1>

<World,1>

<Hello,1>

<Hadoop,1>

<Goodbye,1><Hadoop,1>

<Hello (1,1)>

<World(1,1)>

<Bye(1)>

<Hadoop(1,1)>

<Goodbye(1)>

SHUFFLE

&

SORT

Reduce 2

Reduce 1

Reduce 3

Reduce 4

Reduce 5

Hello ,2

World ,2

Bye,1

Hadoop ,2

Goodbye ,1

Page 9: Pairwise document similarity in large collections with map reduce

Pairwise Document Similarity

MAPREDUCE ALGORITHM•Inverted Index Computation•Pairwise Similarity

Scalable and

Efficient

Page 10: Pairwise document similarity in large collections with map reduce

Constructing Inverted Index (Map Phase)

Document 2BDD

Document 1AABC

Map 1

Map 2

<A,(d1,2)>

<B,(d1,1)>

<C,(d1,1)>

<B,(d2,1)>

<D,(d2,2)>

Document 1ABBE

Map 3

<A,(d3,1)>

<B,(d3,2)>

<E,(d3,1)>

Page 11: Pairwise document similarity in large collections with map reduce

Constructing Inverted Index (Reduce Phase)

<A,(d1,2)>

<B,(d1,1)>

<C,(d1,1)>

<B,(d2,1)>

<D,(d2,2)>

<A,[(d1,2),(d3,1)]>

<B,[(d1,1), (d2,1),(d3,2)]><C,[(d1,1)]>

<D,[(d2,2)]>

SHUFFLE

&

SORT

Reduce 1

Reduce 2

Reduce 3

Reduce 4

<B,[(d1,1), (d2,1),(d3,2)]><C,[(d1,1)]>

<D,[(d2,2)]>

<A,(d3,1)>

<B,(d3,2)>

<E,(d3,1)>

Reduce 5 <E,[(d3,1)]>

<A,[(d1,2),(d3,1)]>

<E,[(d3,1)]>

Page 12: Pairwise document similarity in large collections with map reduce

Space saving technique Group by document ID, not pairs

Golomb’s compression for postings Individual Postings List of Postings

Page 13: Pairwise document similarity in large collections with map reduce

Pairwise document similarity (Map Phase)

<B,[(d1,1), (d2,1),(d3,2)]><C,[(d1,1)]>

<D,[(d2,2)]>

<E,[(d3,1)]>

<A,[(d1,2),(d3,1)]>

Map 1

Map 2

<(d1,d3),2>

<(d1,d2),1(d2,d3),2(d1,d3),2>

Page 14: Pairwise document similarity in large collections with map reduce

Pairwise document similarity (Reduce phase)

<(d1,d3),2>

<(d1,d2),1(d2,d3),2(d1,d3),2>

SHUFFLE

&

SORT

<(d1,d2)[1]>

<(d2,d3)[2]>

<(d1,d3)[2,2]>

Reduce 1

Reduce 2

Reduce 3

<(d1,d2)[1]>

<(d2,d3)[2]>

<(d1,d3)[4]>

Page 15: Pairwise document similarity in large collections with map reduce

Experimental Setup Hadoop 0.16.0 20 machine (4GB memory, 100GB

disk) Similarity function - BM25 Dataset: AQUAINT-2 (newswire text)

• 2.5 GB• 906k documents

Page 16: Pairwise document similarity in large collections with map reduce

Procedure Tokenization Stop word removal Stemming Df-cut

• Fraction of terms with highest document frequency is eliminated – 99% cut (9093)

• 3.7 billion pairs (vs) 8.1 trillion pairs

Linear space and time complexity

Page 17: Pairwise document similarity in large collections with map reduce

Running Time of Pairwise Similarity Comparisons

Page 18: Pairwise document similarity in large collections with map reduce

Effect of df-cut on number of Intermediate pairs

Page 19: Pairwise document similarity in large collections with map reduce

Observations Complexity: O(n2)

Df-cut of 99 percent eliminates meaning bearing terms and some irrelevant terms• Cornell, arthritis• sleek, frail

Df-cut can be relaxed to 99.9 percent

Page 20: Pairwise document similarity in large collections with map reduce

Discussion Exact algorithms used for inverted

index construction and pair-wise document similarity are not specified.

Df-cut – Does a df-cut of 99 percent affect the quality of the results significantly?

The results have not been evaluated.

Page 21: Pairwise document similarity in large collections with map reduce

Thank you