Upload
xml
View
221
Download
0
Embed Size (px)
Citation preview
8/3/2019 07www Duplicates
1/31
Detecting Near-Duplicates for
Web CrawlingGurmeet Singh Manku (Google)
Arvind Jain (Google)Anish Das Sarma (Stanford University)
8/3/2019 07www Duplicates
2/31
2
What are Near-Duplicates?
Identical content, but differ in small portion of document
Advertisements Counters Timestamps
8/3/2019 07www Duplicates
3/31
3
Near-Duplicates: Why and How?
Why do we want to detect near-duplicates? Save storage Search quality
How to determine if a pair of documents arenear-duplicates?
Lots of past work (survey in the paper) Our work: detect near-duplicate webpages
during crawl
8/3/2019 07www Duplicates
4/31
4
Simplified Crawl Architecture
WebIndex
HTMLDocument
Web
Near-duplicate?
traverse
links
newly-crawleddocument(s)
one document
entire index
inserttrash
8/3/2019 07www Duplicates
5/31
5
Near-Duplicate Detection
Why is it hard in a crawl setting? Scale
Tens of billions of documents indexed Millions of pages crawled every day
Need to decide quickly!
8/3/2019 07www Duplicates
6/31
6
Single and Batch Modes
WebIndex
Near-duplicate?
entire indexSingle document
Batch of documents
OR
HTMLDocument
Web
traverse
links
one document
8/3/2019 07www Duplicates
7/31
7
Rest of Talk
Simhash overview Formal definition of the problem
Single and Batch algorithms Experiments Conclusions
8/3/2019 07www Duplicates
8/31
8
Simhash [Charikar 02]
Dimensionality-reduction technique used for near-duplicate detection
Obtain f -bit fingerprint for each document A pair of documents are near duplicate if and
only if fingerprints at most k -bits apart
We experimentally show f=64 , k=3 good.
8/3/2019 07www Duplicates
9/31
9
Simhash
Doc.
w1w2
wn
feature, weight
100110 w1
hash, weight
110000 w2
001001 wn
w1 -w1 -w1 w1 w1 -w1
w2 w2 -w2 -w2 -w2 -w2
-wn -wn wn -wn -wn wn
add
13,108,-22,-5,-32,55sign
110001fingerprint
8/3/2019 07www Duplicates
10/31
10
Problem Definition Input:
Set S of f -bit fingerprints (document index) Query fingerprint Q (new document)
Output: Exists near-duplicate, or not
Batch Mode Input: Set of query fingerprints Running Example: f=64 , k=3
(Q1,Q2) near-duplicate
hamming-distance(simhash( Q1),simhash( Q2)) k
8/3/2019 07www Duplicates
11/31
11
Attempt One
Pre-sortedfingerprints in S
64-bit Q
All Q : hd( Q,Q )k=3
Exact Probes
8/3/2019 07www Duplicates
12/31
12
Attempt Two
Fingerprintsin S 64-bit Q
Exact ProbesS: Allfingerprintsat most k-bits away
from S
(Sort)
8/3/2019 07www Duplicates
13/31
13
Intuition for Our Approach
Observation 1: Consider 2 d f -bit fingerprints insorted order
Most 2 d combinations in d most significant bits exist Can quickly do exact probe on first d (d ) bits
Observation 2:
Q
Q hd(Q,Q) = 3
exact match!
8/3/2019 07www Duplicates
14/31
14
Example
Fingerprints in S
A B C D
64-bit
16-bit
A
B
C
D
64-bit QQ1 Q2 Q3 Q4
Q1
Q2
Q3
Q4
exact search on 16 bits
8/3/2019 07www Duplicates
15/31
15
Example: Analysis
64-bits split into 4 pieces 4 tables with permuted fingerprints
Exact search on 16 bits If 234 (10 billion) fingerprints
Each probe gives 2 34-16 fingerprints
8/3/2019 07www Duplicates
16/31
16
Analysis (contd.)
f -bits split into r pieces tables with permuted
fingerprints
Exact search on f(1-k / r) bits With 2 d existing fingerprints,
each probe yields 2 d - f(1-k /r ) fingerprints
r #tables #matchperprobe
4 4 2 18
5 10 2 6
6 20 8
f=64,k=3,d=34
( ) k r
8/3/2019 07www Duplicates
17/31
17
Same Idea Recursively
Fingerprints in S
A B C D
64-bit
16-bit
A
B
C
D
C
48-bit
C 1
16-bit
12-bit
36-bit
C 2
16-bit
12-bit
36-bit
C 3
16-bit
12-bit
36-bit
C 4
16-bit12-bit
36-bit
8/3/2019 07www Duplicates
18/31
18
General Solution
Space (#tables) / Time (#matches) tradeoff Minimum number of tables, with at most 2 X
matches per probe? General solution:
1, X ( f ,k ,d ) =min r>k X ( fk/r , k , d-(r-k)/r ),
if d < X
otherwise( ) k r
8/3/2019 07www Duplicates
19/31
19
Compression of Tables
We can efficiently compress tables In expectation, first d bits are common in
successive fingerprints Exploit this to compress each of the tables Details in the paper
Brings down space requirements by nearly50%
8/3/2019 07www Duplicates
20/31
20
Rest of Talk
Simhash overview Formal definition of the problem
Single algorithm Batch algorithm Experiments Conclusions
8/3/2019 07www Duplicates
21/31
21
Reminder: Batch Problem
Tens of billions of pages indexed Crawl millions of pages each day
Quickly find all new pages having a near-duplicate in the index
8/3/2019 07www Duplicates
22/31
22
MapReduce Framework
MapReduce framework used within Google massively parallel
Map phase: operate individually on a set of objects
Reduce phase aggregate results of the mapped objects
8/3/2019 07www Duplicates
23/31
23
Batch Algorithm
Suppose 8B existing fingerprints (~32GB aftercompression): File F
1M batch query fingerprints (~8MB): File B F stored in a GFS file system
chunked into roughly 64MB replicated at 3 random nodes
B stored with much higher replication factor
8/3/2019 07www Duplicates
24/31
24
Batch Algorithm (continued)
Map Phase: Duplicate detection within each chunk F i and
whole of B Build multiple tables for B (in memory) Scan F i and probe into B Output near-duplicates in B
Reduce phase Merge outputs
8/3/2019 07www Duplicates
25/31
25
Batch Algorithm (continued)
F1
F2
Fn
B1
B2
B1
B2
B1
B2
merge
8/3/2019 07www Duplicates
26/31
26
Experimental Analysis
Promising preliminary results! Studied:
Choice of simhash parameters Distribution of fingerprints
8/3/2019 07www Duplicates
27/31
27
Choice of Simhash Parameters
8/3/2019 07www Duplicates
28/31
28
Distribution of Fingerprints
8/3/2019 07www Duplicates
29/31
29
Distribution of Fingerprints
8/3/2019 07www Duplicates
30/31
30
Summary
Addressed near-duplicate detection in a web-crawling system
Proposed algorithms for single and batch cases Preliminary experiments to validate our techniques
and suitability of simhash Mini-survey of near-duplicate detection in the paper
8/3/2019 07www Duplicates
31/31
31
Thank you!