Upload
misha
View
55
Download
0
Embed Size (px)
DESCRIPTION
Detecting Near Duplicates for Web Crawling. Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi. Introduction. There are various duplicate documents on the web. Many pages differ in small portion because of advertisement displayed and so on. - PowerPoint PPT Presentation
Citation preview
Detecting Near Duplicates for Web Crawling
Authors :Gurmeet Singh MankArvind JainAnish Das Sarma
Presented by Chintan Udeshi
6/28/2011 1Udeshi-CS572
Udeshi-CS572 2
IntroductionThere are various duplicate documents on the
web. Many pages differ in small portion because of
advertisement displayed and so on. Such pages are irrelevant for crawling point of
you. This paper uses Charikar‘s finnger-printing
technique for the same to find out duplicate documents.
This technique is useful for both online queries and batch queries.
6/28/2011
Udeshi-CS572 3
Advantages of duplicate detectionSaves B.W.Reduction in storage costImprove quality of search engineReduces load on remote host.
6/28/2011
Udeshi-CS572 4
Limitations of duplicate detectionScalingSpeedUse less resources
6/28/2011
Udeshi-CS572 5
FINGERPRINTING WITH SIMHASHExtract set of features from a document along
with corresponding weight for each feature. We use simhash to generate an f-bit finger-
print based on presence or absence of feature in a given document.
When we use simhash, 64-it finger-print will be good enough for 8B we pages.
6/28/2011
Udeshi-CS572 6
Idea behind using Simhash algorithmSimhash has 2 properties :A : The fingerprint of a document is hash of
its features. B :Similar documents have similar hash
values.Our algorithms are designed assuming that
Property A holds and we experimentally measure the impact of non-uniformity introduced by Property B on real datasets.
6/28/2011
Udeshi-CS572 7
Hamming Distance problemConsider a collection of 8B 64-bit
fingerprints, occupying 64GB.We have to decide whether existing 8B 64-bit
fingerprints differs from F in at most k = 3 bit-positions.
Algorithm is different for online queries and batch queries.
6/28/2011
Udeshi-CS572 8
Algorithm for online queriesWe have to build t tables: T1, T2,……. Tt.Table Ti is constructed by applying
permutation to each existing fingerprints.There are 2 steps for the same :Identify all permuted fingerprints in Ti whose
top bit-positions match the other fingerprints top
bit-positions.After following the above step, check if it
differs from other by at most k bit-positions.6/28/2011
Udeshi-CS572 9
Design parameters for the algorithmThere is a trade-off between number of tables
and selecting value of Pi for the table.Increasing the number of tables increases Pi
and hence reduces the query time. De-creasing the number of tables reduces
storage requirements, but reduces Pi and thus increases the query time.
6/28/2011
Udeshi-CS572 10
Algorithm for Batch QueriesFiles are first broken into 64 MB chunks. Each chunk is replicated at three randomly
chosen machines in a cluster.Each chunk is stored as a file in the local
system.First, we solve hamming distance problem for
each 64MB chunk. Later on, we combine output from all the
chunks to produce final output.
6/28/2011
Udeshi-CS572 11
Broder's shingle-based fingerprintsBroder shingle-based finger-print uses Rabin
fingerprints.The algorithm is such that Given an n-bit
message m0,...,mn-1…, fingerprint of m to be the remainder r(x) after division of f(x) by p(x).
6/28/2011
Udeshi-CS572 12
Comparison with Broder's shingle-based fingerprintsFor the comparison, 6 Rabin fingerprints are
calculated. Later on, it is checked to see if 2 or more
finger-prints are matching or not.Each finger-print takes approximately 24
bytes.On the other hand, simhash will take 64-bits
for 8B web pages.
6/28/2011
Udeshi-CS572 13
Experimental ResultsThere is a tradeoff between f and k for
detection of duplicates for web pages using simhash.
Topics includes :Choice of parametersDistribution of finger-printsScalability
6/28/2011
Udeshi-CS572 14
Choice of parameters
Vary K between 1 to 10.
Divide pages into different categories
False Positive True Positive UnknownThere is a trade-off.K=3 gives reasonable
result for 64-bit finger-print.
6/28/2011
Udeshi-CS572 15
Distribution of finger-print (1)Left side of the slide
doesn’t drop rapidly as the right side one.
This is due to the fact that some pages are similar to each other.
So, finger prints differ by moderate number.
6/28/2011
Udeshi-CS572 16
Distribution of finger-print (2)More or less uniform
with spikes in some places.
Reasons: Empty pages. File not found. Multiple websites
uses similar login page.
6/28/2011
Udeshi-CS572 17
Nature of Corpus:
System is mainly divided into 4 documents :Web pages.Files in file systemE-mailDomain-specific CorporaThis paper mainly involves finding near
duplicates for web pages.
6/28/2011
Udeshi-CS572 18
ScalabilityFor batch mode, compressed version of file Q
occupies almost 32GB.Usually, computational time for each file is
approximately 1GBps. So, Computation usually finishes in 100
seconds.
6/28/2011
Udeshi-CS572 19
Need to detect duplicatesWeb MirrorClustering for related documents queryData ExtractionPlagiarismSpam DetectionDuplicate in domain specific corpora
6/28/2011
Udeshi-CS572 20
Feature set per-documentsShingles from page contentDocument vector from page contentConnectivity informationAnchor text and anchor windowPhrases
6/28/2011
Udeshi-CS572 21
Future ResearchCan we categorize web-pages into categories
and search for near duplicates only within the relevant categories.
Feasibility to devise algorithms for detecting portions of web-pages that contains ads or timestamp.
Change sensitivity of simhash algorithm for feature selection and assignment of weights to features.
Algorithm for clustering of the documents.Can we categories documents based on
languages.6/28/2011
Udeshi-CS572 22
Thank you.Q & A ?
6/28/2011