07www Duplicates

  • Upload
    xml

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

  • 8/3/2019 07www Duplicates

    1/31

    Detecting Near-Duplicates for

    Web CrawlingGurmeet Singh Manku (Google)

    Arvind Jain (Google)Anish Das Sarma (Stanford University)

  • 8/3/2019 07www Duplicates

    2/31

    2

    What are Near-Duplicates?

    Identical content, but differ in small portion of document

    Advertisements Counters Timestamps

  • 8/3/2019 07www Duplicates

    3/31

    3

    Near-Duplicates: Why and How?

    Why do we want to detect near-duplicates? Save storage Search quality

    How to determine if a pair of documents arenear-duplicates?

    Lots of past work (survey in the paper) Our work: detect near-duplicate webpages

    during crawl

  • 8/3/2019 07www Duplicates

    4/31

    4

    Simplified Crawl Architecture

    WebIndex

    HTMLDocument

    Web

    Near-duplicate?

    traverse

    links

    newly-crawleddocument(s)

    one document

    entire index

    inserttrash

  • 8/3/2019 07www Duplicates

    5/31

    5

    Near-Duplicate Detection

    Why is it hard in a crawl setting? Scale

    Tens of billions of documents indexed Millions of pages crawled every day

    Need to decide quickly!

  • 8/3/2019 07www Duplicates

    6/31

    6

    Single and Batch Modes

    WebIndex

    Near-duplicate?

    entire indexSingle document

    Batch of documents

    OR

    HTMLDocument

    Web

    traverse

    links

    one document

  • 8/3/2019 07www Duplicates

    7/31

    7

    Rest of Talk

    Simhash overview Formal definition of the problem

    Single and Batch algorithms Experiments Conclusions

  • 8/3/2019 07www Duplicates

    8/31

    8

    Simhash [Charikar 02]

    Dimensionality-reduction technique used for near-duplicate detection

    Obtain f -bit fingerprint for each document A pair of documents are near duplicate if and

    only if fingerprints at most k -bits apart

    We experimentally show f=64 , k=3 good.

  • 8/3/2019 07www Duplicates

    9/31

    9

    Simhash

    Doc.

    w1w2

    wn

    feature, weight

    100110 w1

    hash, weight

    110000 w2

    001001 wn

    w1 -w1 -w1 w1 w1 -w1

    w2 w2 -w2 -w2 -w2 -w2

    -wn -wn wn -wn -wn wn

    add

    13,108,-22,-5,-32,55sign

    110001fingerprint

  • 8/3/2019 07www Duplicates

    10/31

    10

    Problem Definition Input:

    Set S of f -bit fingerprints (document index) Query fingerprint Q (new document)

    Output: Exists near-duplicate, or not

    Batch Mode Input: Set of query fingerprints Running Example: f=64 , k=3

    (Q1,Q2) near-duplicate

    hamming-distance(simhash( Q1),simhash( Q2)) k

  • 8/3/2019 07www Duplicates

    11/31

    11

    Attempt One

    Pre-sortedfingerprints in S

    64-bit Q

    All Q : hd( Q,Q )k=3

    Exact Probes

  • 8/3/2019 07www Duplicates

    12/31

    12

    Attempt Two

    Fingerprintsin S 64-bit Q

    Exact ProbesS: Allfingerprintsat most k-bits away

    from S

    (Sort)

  • 8/3/2019 07www Duplicates

    13/31

    13

    Intuition for Our Approach

    Observation 1: Consider 2 d f -bit fingerprints insorted order

    Most 2 d combinations in d most significant bits exist Can quickly do exact probe on first d (d ) bits

    Observation 2:

    Q

    Q hd(Q,Q) = 3

    exact match!

  • 8/3/2019 07www Duplicates

    14/31

    14

    Example

    Fingerprints in S

    A B C D

    64-bit

    16-bit

    A

    B

    C

    D

    64-bit QQ1 Q2 Q3 Q4

    Q1

    Q2

    Q3

    Q4

    exact search on 16 bits

  • 8/3/2019 07www Duplicates

    15/31

    15

    Example: Analysis

    64-bits split into 4 pieces 4 tables with permuted fingerprints

    Exact search on 16 bits If 234 (10 billion) fingerprints

    Each probe gives 2 34-16 fingerprints

  • 8/3/2019 07www Duplicates

    16/31

    16

    Analysis (contd.)

    f -bits split into r pieces tables with permuted

    fingerprints

    Exact search on f(1-k / r) bits With 2 d existing fingerprints,

    each probe yields 2 d - f(1-k /r ) fingerprints

    r #tables #matchperprobe

    4 4 2 18

    5 10 2 6

    6 20 8

    f=64,k=3,d=34

    ( ) k r

  • 8/3/2019 07www Duplicates

    17/31

    17

    Same Idea Recursively

    Fingerprints in S

    A B C D

    64-bit

    16-bit

    A

    B

    C

    D

    C

    48-bit

    C 1

    16-bit

    12-bit

    36-bit

    C 2

    16-bit

    12-bit

    36-bit

    C 3

    16-bit

    12-bit

    36-bit

    C 4

    16-bit12-bit

    36-bit

  • 8/3/2019 07www Duplicates

    18/31

    18

    General Solution

    Space (#tables) / Time (#matches) tradeoff Minimum number of tables, with at most 2 X

    matches per probe? General solution:

    1, X ( f ,k ,d ) =min r>k X ( fk/r , k , d-(r-k)/r ),

    if d < X

    otherwise( ) k r

  • 8/3/2019 07www Duplicates

    19/31

    19

    Compression of Tables

    We can efficiently compress tables In expectation, first d bits are common in

    successive fingerprints Exploit this to compress each of the tables Details in the paper

    Brings down space requirements by nearly50%

  • 8/3/2019 07www Duplicates

    20/31

    20

    Rest of Talk

    Simhash overview Formal definition of the problem

    Single algorithm Batch algorithm Experiments Conclusions

  • 8/3/2019 07www Duplicates

    21/31

    21

    Reminder: Batch Problem

    Tens of billions of pages indexed Crawl millions of pages each day

    Quickly find all new pages having a near-duplicate in the index

  • 8/3/2019 07www Duplicates

    22/31

    22

    MapReduce Framework

    MapReduce framework used within Google massively parallel

    Map phase: operate individually on a set of objects

    Reduce phase aggregate results of the mapped objects

  • 8/3/2019 07www Duplicates

    23/31

    23

    Batch Algorithm

    Suppose 8B existing fingerprints (~32GB aftercompression): File F

    1M batch query fingerprints (~8MB): File B F stored in a GFS file system

    chunked into roughly 64MB replicated at 3 random nodes

    B stored with much higher replication factor

  • 8/3/2019 07www Duplicates

    24/31

    24

    Batch Algorithm (continued)

    Map Phase: Duplicate detection within each chunk F i and

    whole of B Build multiple tables for B (in memory) Scan F i and probe into B Output near-duplicates in B

    Reduce phase Merge outputs

  • 8/3/2019 07www Duplicates

    25/31

    25

    Batch Algorithm (continued)

    F1

    F2

    Fn

    B1

    B2

    B1

    B2

    B1

    B2

    merge

  • 8/3/2019 07www Duplicates

    26/31

    26

    Experimental Analysis

    Promising preliminary results! Studied:

    Choice of simhash parameters Distribution of fingerprints

  • 8/3/2019 07www Duplicates

    27/31

    27

    Choice of Simhash Parameters

  • 8/3/2019 07www Duplicates

    28/31

    28

    Distribution of Fingerprints

  • 8/3/2019 07www Duplicates

    29/31

    29

    Distribution of Fingerprints

  • 8/3/2019 07www Duplicates

    30/31

    30

    Summary

    Addressed near-duplicate detection in a web-crawling system

    Proposed algorithms for single and batch cases Preliminary experiments to validate our techniques

    and suitability of simhash Mini-survey of near-duplicate detection in the paper

  • 8/3/2019 07www Duplicates

    31/31

    31

    Thank you!