Detecting Near Duplicates for Web Crawling

Detecting Near Duplicates for Web Crawling

Authors :Gurmeet Singh MankArvind JainAnish Das Sarma

Presented by Chintan Udeshi

6/28/2011 1Udeshi-CS572

Udeshi-CS572 2

IntroductionThere are various duplicate documents on the

web. Many pages differ in small portion because of

advertisement displayed and so on. Such pages are irrelevant for crawling point of

you. This paper uses Charikar‘s finnger-printing

technique for the same to find out duplicate documents.

This technique is useful for both online queries and batch queries.

6/28/2011

Udeshi-CS572 3

Advantages of duplicate detectionSaves B.W.Reduction in storage costImprove quality of search engineReduces load on remote host.

6/28/2011

Udeshi-CS572 4

Limitations of duplicate detectionScalingSpeedUse less resources

6/28/2011

Udeshi-CS572 5

FINGERPRINTING WITH SIMHASHExtract set of features from a document along

with corresponding weight for each feature. We use simhash to generate an f-bit finger-

print based on presence or absence of feature in a given document.

When we use simhash, 64-it finger-print will be good enough for 8B we pages.

6/28/2011

Udeshi-CS572 6

Idea behind using Simhash algorithmSimhash has 2 properties :A : The fingerprint of a document is hash of

its features. B :Similar documents have similar hash

values.Our algorithms are designed assuming that

Property A holds and we experimentally measure the impact of non-uniformity introduced by Property B on real datasets.

6/28/2011

Udeshi-CS572 7

Hamming Distance problemConsider a collection of 8B 64-bit

fingerprints, occupying 64GB.We have to decide whether existing 8B 64-bit

fingerprints differs from F in at most k = 3 bit-positions.

Algorithm is different for online queries and batch queries.

6/28/2011

Udeshi-CS572 8

Algorithm for online queriesWe have to build t tables: T1, T2,……. Tt.Table Ti is constructed by applying

permutation to each existing fingerprints.There are 2 steps for the same :Identify all permuted fingerprints in Ti whose

top bit-positions match the other fingerprints top

bit-positions.After following the above step, check if it

differs from other by at most k bit-positions.6/28/2011

Udeshi-CS572 9

Design parameters for the algorithmThere is a trade-off between number of tables

and selecting value of Pi for the table.Increasing the number of tables increases Pi

and hence reduces the query time. De-creasing the number of tables reduces

storage requirements, but reduces Pi and thus increases the query time.

6/28/2011

Udeshi-CS572 10

Algorithm for Batch QueriesFiles are first broken into 64 MB chunks. Each chunk is replicated at three randomly

chosen machines in a cluster.Each chunk is stored as a file in the local

system.First, we solve hamming distance problem for

each 64MB chunk. Later on, we combine output from all the

chunks to produce final output.

6/28/2011

Udeshi-CS572 11

Broder's shingle-based fingerprintsBroder shingle-based finger-print uses Rabin

fingerprints.The algorithm is such that Given an n-bit

message m0,...,mn-1…, fingerprint of m to be the remainder r(x) after division of f(x) by p(x).

6/28/2011

Udeshi-CS572 12

Comparison with Broder's shingle-based fingerprintsFor the comparison, 6 Rabin fingerprints are

calculated. Later on, it is checked to see if 2 or more

finger-prints are matching or not.Each finger-print takes approximately 24

bytes.On the other hand, simhash will take 64-bits

for 8B web pages.

6/28/2011

Udeshi-CS572 13

Experimental ResultsThere is a tradeoff between f and k for

detection of duplicates for web pages using simhash.

Topics includes :Choice of parametersDistribution of finger-printsScalability

6/28/2011

Udeshi-CS572 14

Choice of parameters

Vary K between 1 to 10.

Divide pages into different categories

False Positive True Positive UnknownThere is a trade-off.K=3 gives reasonable

result for 64-bit finger-print.

6/28/2011

Udeshi-CS572 15

Distribution of finger-print (1)Left side of the slide

doesn’t drop rapidly as the right side one.

This is due to the fact that some pages are similar to each other.

So, finger prints differ by moderate number.

6/28/2011

Udeshi-CS572 16

Distribution of finger-print (2)More or less uniform

with spikes in some places.

Reasons: Empty pages. File not found. Multiple websites

uses similar login page.

6/28/2011

Udeshi-CS572 17

Nature of Corpus:

System is mainly divided into 4 documents :Web pages.Files in file systemE-mailDomain-specific CorporaThis paper mainly involves finding near

duplicates for web pages.

6/28/2011

Udeshi-CS572 18

ScalabilityFor batch mode, compressed version of file Q

occupies almost 32GB.Usually, computational time for each file is

approximately 1GBps. So, Computation usually finishes in 100

seconds.

6/28/2011

Udeshi-CS572 19

Need to detect duplicatesWeb MirrorClustering for related documents queryData ExtractionPlagiarismSpam DetectionDuplicate in domain specific corpora

6/28/2011

Udeshi-CS572 20

Feature set per-documentsShingles from page contentDocument vector from page contentConnectivity informationAnchor text and anchor windowPhrases

6/28/2011

Udeshi-CS572 21

Future ResearchCan we categorize web-pages into categories

and search for near duplicates only within the relevant categories.

Feasibility to devise algorithms for detecting portions of web-pages that contains ads or timestamp.

Change sensitivity of simhash algorithm for feature selection and assignment of weights to features.

Algorithm for clustering of the documents.Can we categories documents based on

languages.6/28/2011

Udeshi-CS572 22

Thank you.Q & A ?

6/28/2011

Documents

Detecting Near Duplicates for Web Crawling