26
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola

Detecting Near-Duplicates for Web Crawling

Embed Size (px)

DESCRIPTION

Detecting Near-Duplicates for Web Crawling. Presentation By: Fernando Arreola. Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. Outline. De-duplication Goal of the Paper Why is De-duplication Important? Algorithm Experiment Related Work Tying it Back to Lecture - PowerPoint PPT Presentation

Citation preview

Page 1: Detecting Near-Duplicates for Web Crawling

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING

Authors:

Gurmeet Singh Manku,

Arvind Jain, and

Anish Das Sarma

Presentation By:Fernando Arreola

Page 2: Detecting Near-Duplicates for Web Crawling

26/20/2011Detecting Near-Duplicates for Web

Crawling

Outline

De-duplication Goal of the Paper Why is De-duplication Important? Algorithm Experiment Related Work Tying it Back to Lecture Paper Evaluation Questions

Page 3: Detecting Near-Duplicates for Web Crawling

36/20/2011Detecting Near-Duplicates for Web

Crawling

De-duplication

The process of eliminating near-duplicate web documents in a generic crawl

Challenge of near-duplicates: Identifying exact duplicates is easy

Use checksums How to identify near-duplicate?

Near-duplicates are identical in content but have differences in small areas

Ads, counters, and timestamps

Page 4: Detecting Near-Duplicates for Web Crawling

46/20/2011Detecting Near-Duplicates for Web

Crawling

Goal of the Paper

Present near-duplicate detection system which improves web crawling

Near-duplicate detection system includes: Simhash technique

Technique used to transform a web-page to an f-bit fingerprint

Solution to Hamming Distance Problem Given f-bit fingerprint find all fingerprints in a

given collection which differ by at most k-bit positions

Page 5: Detecting Near-Duplicates for Web Crawling

56/20/2011Detecting Near-Duplicates for Web

Crawling

Why is De-duplication Important? Elimination of near duplicates:

Saves network bandwidth Do not have to crawl content if similar to

previously crawled content Reduces storage cost

Do not have to store in local repository if similar to previously crawled content

Improves quality of search indexes Local repository used for building search

indexes not polluted by near-duplicates

Page 6: Detecting Near-Duplicates for Web Crawling

66/20/2011Detecting Near-Duplicates for Web

Crawling

Algorithm: Simhash Technique Convert web-page to set of features

Using Information Retrieval techniques e.g. tokenization, phrase detection

Give a weight to each feature Hash each feature into a f-bit value Have a f-dimensional vector

Dimension values start at 0 Update f-dimensional vector with weight of feature

If i-th bit of hash value is zero -> subtract i-th vector value by weight of feature

If i-th bit of hash value is one -> add the weight of the feature to the i-th vector value

Vector will have positive and negative components Sign (+/-) of each component are bits for the fingerprint

Page 7: Detecting Near-Duplicates for Web Crawling

76/20/2011Detecting Near-Duplicates for Web

Crawling

Algorithm: Simhash Technique (cont.) Very simple example

One web-page Web-page text: “Simhash Technique”

Reduced to two features “Simhash” -> weight = 2 “Technique” -> weight = 4

Hash features to 4-bits “Simhash” -> 1101 “Technique” -> 0110

Page 8: Detecting Near-Duplicates for Web Crawling

86/20/2011Detecting Near-Duplicates for Web

Crawling

Algorithm: Simhash Technique (cont.) Start vector with all zeroes

0

0

0

0

Page 9: Detecting Near-Duplicates for Web Crawling

96/20/2011Detecting Near-Duplicates for Web

Crawling

Algorithm: Simhash Technique (cont.) Apply “Simhash” feature (weight = 2)

0

0

0

0

2

2

-2

2

1

1

0

1

feature’s f-bit value

0 + 2

0 + 2

0 - 2

0 + 2

calculation

Page 10: Detecting Near-Duplicates for Web Crawling

106/20/2011Detecting Near-Duplicates for Web

Crawling

Algorithm: Simhash Technique (cont.) Apply “Technique” feature (weight = 4)

2

2

-2

2

-2

6

2

-2

0

1

1

0

feature’s f-bit value

2 - 4

2 + 4

-2 + 4

2 - 4

calculation

Page 11: Detecting Near-Duplicates for Web Crawling

116/20/2011Detecting Near-Duplicates for Web

Crawling

Algorithm: Simhash Technique (cont.) Final vector:

Sign of vector values is -,+,+,- Final 4-bit fingerprint = 0110

-2

6

2

-2

Page 12: Detecting Near-Duplicates for Web Crawling

126/20/2011Detecting Near-Duplicates for Web

Crawling

Algorithm: Solution to Hamming Distance Problem Problem: Given f-bit fingerprint (F) find all fingerprints

in a given collection which differ by at most k-bit positions

Solution: Create tables containing the fingerprints

Each table has a permutation (π) and a small integer (p) associated with it

Apply the permutation associated with the table to its fingerprints Sort the tables

Store tables in main-memory of a set of machines Iterate through tables in parallel

Find all permutated fingerprints whose top pi bits match the top pi bits of πi(F)

For the fingerprints that matched, check if they differ from πi(F) in at most k-bits

Page 13: Detecting Near-Duplicates for Web Crawling

136/20/2011Detecting Near-Duplicates for Web

Crawling

Algorithm: Solution to Hamming Distance Problem (cont.) Simple example

F = 0100 1101 K = 3 Have a collection of 8 fingerprints Create two tablesFingerprints

1100 0101

1111 1111

0101 1100

0111 1110

1111 1110

0000 0001

1111 0101

1101 0010

Page 14: Detecting Near-Duplicates for Web Crawling

146/20/2011Detecting Near-Duplicates for Web

Crawling

Algorithm: Solution to Hamming Distance Problem (cont.)

Fingerprints

1100 0101

1111 1111

0101 1100

0111 1110

1111 1110

0010 0001

1111 0101

1101 0010

p = 3; π = Swap last four bits with first four bits

0101 1100

1111 1111

1100 0101

1110 0111

p = 3; π = Move last two bits to the front

1011 1111

0100 1000

0111 1101

1011 0100

Page 15: Detecting Near-Duplicates for Web Crawling

156/20/2011Detecting Near-Duplicates for Web

Crawling

Algorithm: Solution to Hamming Distance Problem (cont.)

p = 3; π = Swap last four bits with first four bits

0101 1100

1111 1111

1100 0101

1110 0111

p = 3; π = Move last two bits to the front

1011 1111

0100 1000

0111 1101

1011 0100

Sort

p = 3; π = Swap last four bits with first four bits

0101 1100

1100 0101

1110 0111

1111 1111

p = 3; π = Move last two bits to the front

0100 1000

0111 1101

1011 0100

1011 1111

Sort

Page 16: Detecting Near-Duplicates for Web Crawling

166/20/2011Detecting Near-Duplicates for Web

Crawling

Algorithm: Solution to Hamming Distance Problem (cont.)

F = 0100 1101

p = 3; π = Swap last four bits with first four bits

0101 1100

1100 0101

1110 0111

1111 1111

p = 3; π = Move last two bits to the front

0100 1000

0111 1101

1011 0100

1011 1111

π(F) = 1101 0100 π(F) = 0101 0011

Match!

Page 17: Detecting Near-Duplicates for Web Crawling

176/20/2011Detecting Near-Duplicates for Web

Crawling

Algorithm: Solution to Hamming Distance Problem (cont.) With k =3, only fingerprint in first table

is a near-duplicate of the F fingerprint

p = 3; π = Swap last four bits with first four bits

1 1 0 1 0 1 0 0

1 1 0 0 0 1 0 1

p = 3; π = Move last two bits to the front

0 1 0 1 0 0 1 1

0 1 0 0 1 0 0 0

F

Page 18: Detecting Near-Duplicates for Web Crawling

186/20/2011Detecting Near-Duplicates for Web

Crawling

Algorithm: Compression of Tables Store first fingerprint in a block (1024 bytes) XOR the current fingerprint with the previous

one Append to the block the Huffman code for the

position of the most significant 1 bit Append to the block the bits after the most

significant 1 bit Repeat steps 2-4 until block is full

Comparing to the query fingerprint Use last fingerprint (key) in the block and perform

interpolation search to decompress appropriate block

Page 19: Detecting Near-Duplicates for Web Crawling

196/20/2011Detecting Near-Duplicates for Web

Crawling

Algorithm: Extending to Batch Queries Problem: Want to get near-duplicates for batch of

query fingerprints – not just one Solution:

Use Google File System (GFS) and MapReduce Create two files

File F has the collection of fingerprints File Q has the query fingerprints

Store the files in GFS GFS breaks up the files into chunks

Use MapReduce to solve the Hamming Distance Problem for each chunk of F for all queries in Q

MapReduce allows for a task to be created per chunk Iterate through chunks in parallel Each task produces output of near-duplicates found

Produce sorted file from output of each task Remove duplicates if necessary

Page 20: Detecting Near-Duplicates for Web Crawling

206/20/2011Detecting Near-Duplicates for Web

Crawling

Experiment: Parameters

8 Billion web pages used K = 1 …10 Manually tagged pairs as follows:

True positives Differ slightly

False positives Radically different pairs

Unknown Could not be evaluated

Page 21: Detecting Near-Duplicates for Web Crawling

216/20/2011Detecting Near-Duplicates for Web

Crawling

Experiment: Results

Accuracy Low k value -> a lot of false negatives High k value -> a lot of false positives Best value -> k = 3

75% of near-duplicates reported 75% of reported cases are true positives

Running Time Solution Hamming Distance: O(log(p)) Batch Query + Compression:

32GB File & 200 tasks -> runs under 100 seconds

Page 22: Detecting Near-Duplicates for Web Crawling

226/20/2011Detecting Near-Duplicates for Web

Crawling

Related Work

Clustering related documents Detect near-duplicates to show related pages

Data extraction Determine schema of similar pages to obtain

information Plagiarism

Detect pages that have borrowed from each other

Spam Detect spam before user receives it

Page 23: Detecting Near-Duplicates for Web Crawling

236/20/2011Detecting Near-Duplicates for Web

Crawling

Tying it Back to Lecture

Similarities Indicated importance of de-duplication to save

crawler resources Brief summary of several uses for near-

duplicate detection

Differences Lecture focus:

Breadth-first look at algorithms for near-duplicate detection

Paper focus: In-depth look of simhash and Hamming Distance

algorithm Includes how to implement and effectiveness

Page 24: Detecting Near-Duplicates for Web Crawling

246/20/2011Detecting Near-Duplicates for Web

Crawling

Paper Evaluation: Pros

Thorough step-by-step explanation of the algorithm implementation

Thorough explanation on how the conclusions were reached

Included brief description of how to improve simhash + Hamming Distance algorithm Categorize web-pages before running

simhash, create algorithm to remove ads or timestamps, etc.

Page 25: Detecting Near-Duplicates for Web Crawling

256/20/2011Detecting Near-Duplicates for Web

Crawling

Paper Evaluation: Cons

No comparison How much more effective or faster is it than

other algorithms? By how much did it improve the crawler?

Limited batch queries to a specific technology Implementation required use of GFS Approach not restricted to certain

technology might be more applicable

Page 26: Detecting Near-Duplicates for Web Crawling

266/20/2011Detecting Near-Duplicates for Web

Crawling

Any Questions?

???