07www Duplicates

8/3/2019 07www Duplicates

1/31

Detecting Near-Duplicates for

Web CrawlingGurmeet Singh Manku (Google)

Arvind Jain (Google)Anish Das Sarma (Stanford University)


2/31

2

What are Near-Duplicates?

Identical content, but differ in small portion of document

Advertisements Counters Timestamps


3/31

3

Near-Duplicates: Why and How?

Why do we want to detect near-duplicates? Save storage Search quality

How to determine if a pair of documents arenear-duplicates?

Lots of past work (survey in the paper) Our work: detect near-duplicate webpages

during crawl


4/31

4

Simplified Crawl Architecture

WebIndex

HTMLDocument

Web

Near-duplicate?

traverse

links

newly-crawleddocument(s)

one document

entire index

inserttrash


5/31

5

Near-Duplicate Detection

Why is it hard in a crawl setting? Scale

Tens of billions of documents indexed Millions of pages crawled every day

Need to decide quickly!


6/31

6

Single and Batch Modes

WebIndex

Near-duplicate?

entire indexSingle document

Batch of documents

OR

HTMLDocument

Web

traverse

links

one document


7/31

7

Rest of Talk

Simhash overview Formal definition of the problem

Single and Batch algorithms Experiments Conclusions


8/31

8

Simhash [Charikar 02]

Dimensionality-reduction technique used for near-duplicate detection

Obtain f -bit fingerprint for each document A pair of documents are near duplicate if and

only if fingerprints at most k -bits apart

We experimentally show f=64 , k=3 good.


9/31

9

Simhash

Doc.

w1w2

wn

feature, weight

100110 w1

hash, weight

110000 w2

001001 wn

w1 -w1 -w1 w1 w1 -w1

w2 w2 -w2 -w2 -w2 -w2

-wn -wn wn -wn -wn wn

add

13,108,-22,-5,-32,55sign

110001fingerprint


10/31

10

Problem Definition Input:

Set S of f -bit fingerprints (document index) Query fingerprint Q (new document)

Output: Exists near-duplicate, or not

Batch Mode Input: Set of query fingerprints Running Example: f=64 , k=3

(Q1,Q2) near-duplicate

hamming-distance(simhash( Q1),simhash( Q2)) k


11/31

11

Attempt One

Pre-sortedfingerprints in S

64-bit Q

All Q : hd( Q,Q )k=3

Exact Probes


12/31

12

Attempt Two

Fingerprintsin S 64-bit Q

Exact ProbesS: Allfingerprintsat most k-bits away

from S

(Sort)


13/31

13

Intuition for Our Approach

Observation 1: Consider 2 d f -bit fingerprints insorted order

Most 2 d combinations in d most significant bits exist Can quickly do exact probe on first d (d ) bits

Observation 2:

Q

Q hd(Q,Q) = 3

exact match!


14/31

14

Example

Fingerprints in S

A B C D

64-bit

16-bit

A

B

C

D

64-bit QQ1 Q2 Q3 Q4

Q1

Q2

Q3

Q4

exact search on 16 bits


15/31

15

Example: Analysis

64-bits split into 4 pieces 4 tables with permuted fingerprints

Exact search on 16 bits If 234 (10 billion) fingerprints

Each probe gives 2 34-16 fingerprints


16/31

16

Analysis (contd.)

f -bits split into r pieces tables with permuted

fingerprints

Exact search on f(1-k / r) bits With 2 d existing fingerprints,

each probe yields 2 d - f(1-k /r ) fingerprints

r #tables #matchperprobe

4 4 2 18

5 10 2 6

6 20 8

f=64,k=3,d=34

( ) k r


17/31

17

Same Idea Recursively

Fingerprints in S

A B C D

64-bit

16-bit

A

B

C

D

C

48-bit

C 1

16-bit

12-bit

36-bit

C 2

16-bit

12-bit

36-bit

C 3

16-bit

12-bit

36-bit

C 4

16-bit12-bit

36-bit


18/31

18

General Solution

Space (#tables) / Time (#matches) tradeoff Minimum number of tables, with at most 2 X

matches per probe? General solution:

1, X ( f ,k ,d ) =min r>k X ( fk/r , k , d-(r-k)/r ),

if d < X

otherwise( ) k r


19/31

19

Compression of Tables

We can efficiently compress tables In expectation, first d bits are common in

successive fingerprints Exploit this to compress each of the tables Details in the paper

Brings down space requirements by nearly50%


20/31

20

Rest of Talk

Simhash overview Formal definition of the problem

Single algorithm Batch algorithm Experiments Conclusions


21/31

21

Reminder: Batch Problem

Tens of billions of pages indexed Crawl millions of pages each day

Quickly find all new pages having a near-duplicate in the index


22/31

22

MapReduce Framework

MapReduce framework used within Google massively parallel

Map phase: operate individually on a set of objects

Reduce phase aggregate results of the mapped objects


23/31

23

Batch Algorithm

Suppose 8B existing fingerprints (~32GB aftercompression): File F

1M batch query fingerprints (~8MB): File B F stored in a GFS file system

chunked into roughly 64MB replicated at 3 random nodes

B stored with much higher replication factor


24/31

24

Batch Algorithm (continued)

Map Phase: Duplicate detection within each chunk F i and

whole of B Build multiple tables for B (in memory) Scan F i and probe into B Output near-duplicates in B

Reduce phase Merge outputs


25/31

25

Batch Algorithm (continued)

F1

F2

Fn

B1

B2

B1

B2

B1

B2

merge


26/31

26

Experimental Analysis

Promising preliminary results! Studied:

Choice of simhash parameters Distribution of fingerprints


27/31

27

Choice of Simhash Parameters


28/31

28

Distribution of Fingerprints


29/31

29

Distribution of Fingerprints


30/31

30

Summary

Addressed near-duplicate detection in a web-crawling system

Proposed algorithms for single and batch cases Preliminary experiments to validate our techniques

and suitability of simhash Mini-survey of near-duplicate detection in the paper


31/31

31

Thank you!

Documents

07www Duplicates