Detecting Near-Duplicates for Web Crawling

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING

Authors:

Gurmeet Singh Manku,

Arvind Jain, and

Anish Das Sarma

Presentation By:Fernando Arreola

26/20/2011Detecting Near-Duplicates for Web

Crawling

Outline

De-duplication Goal of the Paper Why is De-duplication Important? Algorithm Experiment Related Work Tying it Back to Lecture Paper Evaluation Questions


Crawling

De-duplication

The process of eliminating near-duplicate web documents in a generic crawl

Challenge of near-duplicates: Identifying exact duplicates is easy

Use checksums How to identify near-duplicate?

Near-duplicates are identical in content but have differences in small areas

Ads, counters, and timestamps


Crawling

Goal of the Paper

Present near-duplicate detection system which improves web crawling

Near-duplicate detection system includes: Simhash technique

Technique used to transform a web-page to an f-bit fingerprint

Solution to Hamming Distance Problem Given f-bit fingerprint find all fingerprints in a

given collection which differ by at most k-bit positions


Crawling

Why is De-duplication Important? Elimination of near duplicates:

Saves network bandwidth Do not have to crawl content if similar to

previously crawled content Reduces storage cost

Do not have to store in local repository if similar to previously crawled content

Improves quality of search indexes Local repository used for building search

indexes not polluted by near-duplicates


Crawling

Algorithm: Simhash Technique Convert web-page to set of features

Using Information Retrieval techniques e.g. tokenization, phrase detection

Give a weight to each feature Hash each feature into a f-bit value Have a f-dimensional vector

Dimension values start at 0 Update f-dimensional vector with weight of feature

If i-th bit of hash value is zero -> subtract i-th vector value by weight of feature

If i-th bit of hash value is one -> add the weight of the feature to the i-th vector value

Vector will have positive and negative components Sign (+/-) of each component are bits for the fingerprint


Crawling

Algorithm: Simhash Technique (cont.) Very simple example

One web-page Web-page text: “Simhash Technique”

Reduced to two features “Simhash” -> weight = 2 “Technique” -> weight = 4

Hash features to 4-bits “Simhash” -> 1101 “Technique” -> 0110


Crawling

Algorithm: Simhash Technique (cont.) Start vector with all zeroes

0

0

0

0


Crawling

Algorithm: Simhash Technique (cont.) Apply “Simhash” feature (weight = 2)

0

0

0

0

2

2

-2

2

1

1

0

1

feature’s f-bit value

0 + 2

0 + 2

0 - 2

0 + 2

calculation


Crawling

Algorithm: Simhash Technique (cont.) Apply “Technique” feature (weight = 4)

2

2

-2

2

-2

6

2

-2

0

1

1

0

feature’s f-bit value

2 - 4

2 + 4

-2 + 4

2 - 4

calculation


Crawling

Algorithm: Simhash Technique (cont.) Final vector:

Sign of vector values is -,+,+,- Final 4-bit fingerprint = 0110

-2

6

2

-2


Crawling

Algorithm: Solution to Hamming Distance Problem Problem: Given f-bit fingerprint (F) find all fingerprints

in a given collection which differ by at most k-bit positions

Solution: Create tables containing the fingerprints

Each table has a permutation (π) and a small integer (p) associated with it

Apply the permutation associated with the table to its fingerprints Sort the tables

Store tables in main-memory of a set of machines Iterate through tables in parallel

Find all permutated fingerprints whose top pi bits match the top pi bits of πi(F)

For the fingerprints that matched, check if they differ from πi(F) in at most k-bits


Crawling

Algorithm: Solution to Hamming Distance Problem (cont.) Simple example

F = 0100 1101 K = 3 Have a collection of 8 fingerprints Create two tablesFingerprints

1100 0101

1111 1111

0101 1100

0111 1110

1111 1110

0000 0001

1111 0101

1101 0010


Crawling

Algorithm: Solution to Hamming Distance Problem (cont.)

Fingerprints

1100 0101

1111 1111

0101 1100

0111 1110

1111 1110

0010 0001

1111 0101

1101 0010

p = 3; π = Swap last four bits with first four bits

0101 1100

1111 1111

1100 0101

1110 0111

p = 3; π = Move last two bits to the front

1011 1111

0100 1000

0111 1101

1011 0100


Crawling



0101 1100

1111 1111

1100 0101

1110 0111


1011 1111

0100 1000

0111 1101

1011 0100

Sort


0101 1100

1100 0101

1110 0111

1111 1111


0100 1000

0111 1101

1011 0100

1011 1111

Sort


Crawling


F = 0100 1101


0101 1100

1100 0101

1110 0111

1111 1111


0100 1000

0111 1101

1011 0100

1011 1111

π(F) = 1101 0100 π(F) = 0101 0011

Match!


Crawling

Algorithm: Solution to Hamming Distance Problem (cont.) With k =3, only fingerprint in first table

is a near-duplicate of the F fingerprint


1 1 0 1 0 1 0 0

1 1 0 0 0 1 0 1


0 1 0 1 0 0 1 1

0 1 0 0 1 0 0 0

F


Crawling

Algorithm: Compression of Tables Store first fingerprint in a block (1024 bytes) XOR the current fingerprint with the previous

one Append to the block the Huffman code for the

position of the most significant 1 bit Append to the block the bits after the most

significant 1 bit Repeat steps 2-4 until block is full

Comparing to the query fingerprint Use last fingerprint (key) in the block and perform

interpolation search to decompress appropriate block


Crawling

Algorithm: Extending to Batch Queries Problem: Want to get near-duplicates for batch of

query fingerprints – not just one Solution:

Use Google File System (GFS) and MapReduce Create two files

File F has the collection of fingerprints File Q has the query fingerprints

Store the files in GFS GFS breaks up the files into chunks

Use MapReduce to solve the Hamming Distance Problem for each chunk of F for all queries in Q

MapReduce allows for a task to be created per chunk Iterate through chunks in parallel Each task produces output of near-duplicates found

Produce sorted file from output of each task Remove duplicates if necessary


Crawling

Experiment: Parameters

8 Billion web pages used K = 1 …10 Manually tagged pairs as follows:

True positives Differ slightly

False positives Radically different pairs

Unknown Could not be evaluated


Crawling

Experiment: Results

Accuracy Low k value -> a lot of false negatives High k value -> a lot of false positives Best value -> k = 3

75% of near-duplicates reported 75% of reported cases are true positives

Running Time Solution Hamming Distance: O(log(p)) Batch Query + Compression:

32GB File & 200 tasks -> runs under 100 seconds


Crawling

Related Work

Clustering related documents Detect near-duplicates to show related pages

Data extraction Determine schema of similar pages to obtain

information Plagiarism

Detect pages that have borrowed from each other

Spam Detect spam before user receives it


Crawling

Tying it Back to Lecture

Similarities Indicated importance of de-duplication to save

crawler resources Brief summary of several uses for near-

duplicate detection

Differences Lecture focus:

Breadth-first look at algorithms for near-duplicate detection

Paper focus: In-depth look of simhash and Hamming Distance

algorithm Includes how to implement and effectiveness


Crawling

Paper Evaluation: Pros

Thorough step-by-step explanation of the algorithm implementation

Thorough explanation on how the conclusions were reached

Included brief description of how to improve simhash + Hamming Distance algorithm Categorize web-pages before running

simhash, create algorithm to remove ads or timestamps, etc.


Crawling

Paper Evaluation: Cons

No comparison How much more effective or faster is it than

other algorithms? By how much did it improve the crawler?

Limited batch queries to a specific technology Implementation required use of GFS Approach not restricted to certain

technology might be more applicable


Crawling

Any Questions?

???

Documents

Detecting Near-Duplicates for Web Crawling