12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4...

Preview:

Citation preview

5/12/17

1

Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

http://www.mmds.org

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

High dim. data

Localitysensitivehashing

Clustering

Dimensionality

reduction

Graph data

PageRank,SimRank

NetworkAnalysis

SpamDetection

Infinite data

Filteringdata

streams

Webadvertising

Queriesonstreams

Machine learning

SVM

DecisionTrees

Perceptron,kNN

Apps

Recommendersystems

AssociationRules

Duplicatedocumentdetection

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 2

5/12/17

2

3J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

[Hays and Efros, SIGGRAPH 2007]

4J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

[Hays and Efros, SIGGRAPH 2007]

5/12/17

3

10 nearest neighbors from a collection of 20,000 imagesJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 5

[Hays and Efros, SIGGRAPH 2007]

10 nearest neighbors from a collection of 2 million imagesJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 6

[Hays and Efros, SIGGRAPH 2007]

5/12/17

4

¡ Manyproblemscanbeexpressedasfinding“similar”sets:§ Findnear-neighborsinhigh-dimensional space

¡ Examples:§ Pageswithsimilarwords

§ Forduplicatedetection,classificationbytopic§ Customerswhopurchasedsimilarproducts

§ Productswithsimilarcustomersets§ Imageswithsimilarfeatures

§ Userswhovisitedsimilarwebsites

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 7

¡ Given:Highdimensionaldatapoints𝒙𝟏, 𝒙𝟐, …§ Forexample: Imageisalongvectorofpixelcolors

1 2 10 2 10 1 0

→ [121021010]

¡ Andsomedistancefunction𝒅(𝒙𝟏, 𝒙𝟐)§ Whichquantifiesthe“distance”between𝒙𝟏 and𝒙𝟐

¡ Goal: Findallpairsofdatapoints(𝒙𝒊, 𝒙𝒋) thatarewithinsomedistancethreshold𝒅 𝒙𝒊, 𝒙𝒋 ≤ 𝒔

¡ Note: Naïvesolutionwouldtake𝑶 𝑵𝟐 Lwhere𝑵 isthenumberofdatapoints

¡ MAGIC:Thiscanbedonein𝑶 𝑵 !!How?J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 8

5/12/17

5

¡ Lasttime: Findingfrequentpairs

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 9

Item

s 1…

N

Items 1…N

Count of pair {i,j} in the data

Naïve solution:Single pass but requires space quadratic in the number of items

Item

s 1…

K

Items 1…K

Count of pair {i,j}in the data

A-Priori:First pass: Find frequent singletonsFor a pair to be a frequent pair candidate, its singletons have to be frequent!Second pass:Count only candidate pairs!

N … number of distinct itemsK … number of items with support ³ s

¡ Lasttime: Findingfrequentpairs¡ Furtherimprovement:PCY§ Pass1:

§ Countexactfrequencyofeachitem:§ Takepairsofitems{i,j},hashthemintoB bucketsandcountofthenumberofpairsthathashedtoeachbucket:

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 10

Items 1…N

Basket 1: {1,2,3}Pairs: {1,2} {1,3} {2,3}

Buckets 1…B2 1

5/12/17

6

¡ Lasttime: Findingfrequentpairs¡ Furtherimprovement:PCY§ Pass1:

§ Countexactfrequencyofeachitem:§ Takepairsofitems{i,j},hashthemintoB bucketsandcountofthenumberofpairsthathashedtoeachbucket:

§ Pass2:§ Forapair{i,j}tobeacandidateforafrequentpair,itssingletons{i},{j}havetobefrequentandthepairhastohashtoafrequentbucket!

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 11

Items 1…N

Basket 1: {1,2,3}Pairs: {1,2} {1,3} {2,3}Basket 2: {1,2,4}Pairs: {1,2} {1,4} {2,4}

Buckets 1…B3 1 2

¡ Lasttime: Findingfrequentpairs¡ Furtherimprovement:PCY§ Pass1:

§ Countexactfrequencyofeachitem:§ Takepairsofitems{i,j},hashthemintoB bucketsandcountofthenumberofpairsthathashedtoeachbucket:

§ Pass2:§ Forapair{i,j}tobeacandidateforafrequentpair,itssingletonshavetobefrequentanditshastohashtoafrequentbucket!

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 12

Items 1…N

Basket 1: {1,2,3}Pairs: {1,2} {1,3} {2,3}Basket 2: {1,2,4}Pairs: {1,2} {1,4} {2,4}

Buckets 1…B3 1 2

Previous lecture: A-PrioriMain idea: CandidatesInstead of keeping a count of each pair, only keep a count of candidate pairs!

Today’s lecture: Find pairs of similar docsMain idea: Candidates-- Pass 1: Take documents and hash them to buckets such that documents that are similar hash to the same bucket-- Pass 2: Only compare documents that are candidates (i.e., they hashed to a same bucket)Benefits: Instead of O(N2) comparisons, we need O(N) comparisons to find similar documents

5/12/17

7

¡ Goal: Findnear-neighborsinhigh-dim.space§ Weformallydefine“nearneighbors”aspointsthatarea“smalldistance”apart

¡ Foreachapplication,wefirstneedtodefinewhat“distance”means

¡ Today:Jaccard distance/similarity§ TheJaccard similarity oftwosets isthesizeoftheirintersectiondividedbythesizeoftheirunion:sim(C1,C2)=|C1ÇC2|/|C1ÈC2|

§ Jaccard distance: d(C1,C2)=1- |C1ÇC2|/|C1ÈC2|

14J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

3 in intersection8 in unionJaccard similarity= 3/8Jaccard distance = 5/8

5/12/17

8

¡ Goal: Givenalargenumber(𝑵 inthemillionsorbillions)ofdocuments,find“nearduplicate”pairs

¡ Applications:§ Mirrorwebsites,orapproximatemirrors

§ Don’twanttoshowbothinsearchresults§ Similarnewsarticlesatmanynewssites

§ Clusterarticlesby“samestory”¡ Problems:

§ Manysmallpiecesofonedocumentcanappearoutoforderinanother

§ Toomanydocumentstocompareallpairs§ Documentsaresolargeorsomanythattheycannotfitinmainmemory

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 15

1. Shingling: Convertdocumentstosets

2. Min-Hashing: Convertlargesetstoshortsignatures,whilepreservingsimilarity

3. Locality-SensitiveHashing: Focusonpairsofsignatureslikelytobefromsimilardocuments

§ Candidatepairs!

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 16

5/12/17

9

17

Docu-ment

The setof stringsof length kthat appearin the doc-ument

Signatures:short integervectors thatrepresent thesets, andreflect theirsimilarity

Locality-SensitiveHashing

Candidatepairs:those pairsof signaturesthat we needto test forsimilarity

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Step1: Shingling: Convertdocumentstosets

Docu-ment

The setof stringsof length kthat appearin the doc-ument

5/12/17

10

¡ Step1: Shingling: Convertdocumentstosets

¡ Simpleapproaches:§ Document=setofwordsappearingindocument§ Document=setof“important”words§ Don’tworkwellforthisapplication.Why?

¡ Needtoaccountfororderingofwords!¡ Adifferentway:Shingles!

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 19

¡ Ak-shingle (ork-gram)foradocumentisasequenceofktokensthatappearsinthedoc§ Tokenscanbecharacters,wordsorsomethingelse,dependingontheapplication

§ Assumetokens=charactersforexamples

¡ Example: k=2;documentD1=abcabSetof2-shingles:S(D1)={ab,bc,ca}§ Option: Shinglesasabag(multiset),countabtwice:S’(D1)={ab,bc,ca, ab}

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 20

5/12/17

11

¡ Tocompresslongshingles,wecanhash themto(say)4bytes

¡ Representadocumentbythesetofhashvaluesofitsk-shingles§ Idea: Twodocumentscould(rarely)appeartohaveshinglesincommon,wheninfactonlythehash-valueswereshared

¡ Example: k=2;documentD1= abcabSetof2-shingles:S(D1)={ab,bc,ca}Hashthesingles:h(D1)={1,5,7}

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 21

¡ DocumentD1isasetofitsk-shinglesC1=S(D1)¡ Equivalently,eachdocumentisa0/1vectorinthespaceofk-shingles§ Eachuniqueshingleisadimension§ Vectorsareverysparse

¡ AnaturalsimilaritymeasureistheJaccard similarity:

sim(D1,D2)=|C1ÇC2|/|C1ÈC2|

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 22

5/12/17

12

¡ Documentsthathavelotsofshinglesincommonhavesimilartext,evenifthetextappearsindifferentorder

¡ Caveat: Youmustpickk largeenough,ormostdocumentswillhavemostshingles§ k =5isOKforshortdocuments§ k =10isbetterforlongdocuments

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 23

¡ Supposeweneedtofindnear-duplicatedocumentsamong𝑵 = 𝟏milliondocuments

¡ Naïvely,wewouldhavetocomputepairwiseJaccard similaritiesforeverypairofdocs§ 𝑵(𝑵 − 𝟏)/𝟐 ≈5*1011 comparisons§ At105 secs/dayand106 comparisons/sec,itwouldtake5days

¡ For𝑵 = 𝟏𝟎million,ittakesmorethanayear…

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 24

5/12/17

13

Step2:Minhashing: Convertlargesets toshortsignatures,whilepreservingsimilarity

Docu-ment

The setof stringsof length kthat appearin the doc-ument

Signatures:short integervectors thatrepresent thesets, andreflect theirsimilarity

¡ Manysimilarityproblemscanbeformalizedasfindingsubsetsthathavesignificantintersection

¡ Encodesetsusing0/1(bit,boolean)vectors§ Onedimensionperelementintheuniversalset

¡ InterpretsetintersectionasbitwiseAND,andsetunionasbitwiseOR

¡ Example: C1 =10111;C2 =10011§ Sizeofintersection=3;sizeofunion=4,§ Jaccard similarity (notdistance)=3/4§ Distance:d(C1,C2)=1– (Jaccard similarity)=1/4

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 26

5/12/17

14

¡ Rows =elements(shingles)¡ Columns =sets(documents)§ 1inrowe andcolumns ifandonlyife isamemberofs

§ ColumnsimilarityistheJaccardsimilarityofthecorrespondingsets(rowswithvalue1)

§ Typicalmatrixissparse!¡ Eachdocumentisacolumn:

§ Example: sim(C1 ,C2)=?§ Sizeofintersection=3;sizeofunion=6,Jaccard similarity(notdistance)=3/6

§ d(C1,C2)=1– (Jaccard similarity)=3/627J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

0101

0111

1001

1000

10101011

0111 Documents

Shin

gles

¡ Wemightnotreallyrepresentthedatabyabooleanmatrix.

¡ Sparsematricesareusuallybetterrepresentedbythelistofplaceswherethereisanon-zerovalue.

¡ Butthematrixpictureisconceptuallyuseful.

28

5/12/17

15

1. Whenthesetsaresolargeorsomanythattheycannotfitinmainmemory.

2. Or,whentherearesomanysetsthatcomparingallpairsofsetstakestoomuchtime.

3. Orboth.

29

¡ Sofar:§ Documents® Setsofshingles§ Representsetsasboolean vectorsinamatrix

¡ Nextgoal:Findsimilarcolumnswhilecomputingsmallsignatures§ Similarityofcolumns==similarityofsignatures

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 30

5/12/17

16

¡ NextGoal:Findsimilarcolumns,Smallsignatures¡ Naïveapproach:

§ 1)Signaturesofcolumns: smallsummariesofcolumns§ 2)Examinepairsofsignatures tofindsimilarcolumns

§ Essential: Similaritiesofsignaturesandcolumnsarerelated

§ 3)Optional: Checkthatcolumnswithsimilarsignaturesarereallysimilar

¡ Warnings:§ Comparingallpairsmaytaketoomuchtime:JobforLSH

§ Thesemethodscanproducefalsenegatives,andevenfalsepositives(iftheoptionalcheckisnotmade)

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 31

¡ Keyidea: “hash”eachcolumnC toasmallsignature h(C),suchthat:§ (1) h(C) issmallenoughthatthesignaturefitsinRAM§ (2) sim(C1,C2) isthesameasthe“similarity”ofsignaturesh(C1)andh(C2)

¡ Goal: Findahashfunctionh(·) suchthat:§ Ifsim(C1,C2) ishigh,thenwithhighprob.h(C1)=h(C2)§ Ifsim(C1,C2) islow,thenwithhighprob.h(C1)≠h(C2)

¡ Hashdocsintobuckets.Expectthat“most”pairsofnearduplicatedocshashintothesamebucket!

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 32

5/12/17

17

¡ Goal: Findahashfunctionh(·) suchthat:§ ifsim(C1,C2) ishigh,thenwithhighprob.h(C1)=h(C2)§ ifsim(C1,C2) islow,thenwithhighprob.h(C1)≠h(C2)

¡ Clearly,thehashfunctiondependsonthesimilaritymetric:§ Notallsimilaritymetricshaveasuitablehashfunction

¡ ThereisasuitablehashfunctionfortheJaccard similarity: ItiscalledMin-Hashing

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 33

34

¡ Imaginetherowsoftheboolean matrixpermutedunderrandompermutationp

¡ Definea“hash”functionhp(C) =theindexofthefirst (inthepermutedorderp)rowinwhichcolumnC hasvalue1:

hp (C) = minp p(C)

¡ Useseveral(e.g.,100)independenthashfunctions(thatis,permutations)tocreateasignatureofacolumn

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

5/12/17

18

35

3

4

7

2

6

1

5

Signature matrix M

1212

5

7

6

3

1

2

4

1412

4

5

1

6

7

3

2

2121

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

2nd element of the permutation is the first to map to a 1

4th element of the permutation is the first to map to a 1

0101

0101

1010

1010

1010

1001

0101

Input matrix (Shingles x Documents) Permutation p

Note: Another (equivalent) way is to store row indexes: 1 5 1 5

2 3 1 36 4 6 4

¡ Choosearandompermutationp¡ Claim: Pr[hp(C1)=hp(C2)]=sim(C1,C2)¡ Why?

§ LetX beadoc(setofshingles),yÎ X isashingle§ Then: Pr[p(y)=min(p(X))]=1/|X|

§ ItisequallylikelythatanyyÎ X ismappedtothemin element

§ Lety bes.t. p(y)=min(p(C1ÈC2))§ Theneither: p(y)=min(p(C1))ifyÎ C1,or

p(y)=min(p(C2))ifyÎ C2§ Sotheprob.thatboth aretrueistheprob.y Î C1Ç C2§ Pr[min(p(C1))=min(p(C2))]=|C1ÇC2|/|C1ÈC2|=sim(C1,C2)

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 36

01

10

00

11

00

00

One of the twocols had to have1 at position y

5/12/17

19

¡ GivencolsC1 andC2,rowsmaybeclassifiedas:C1 C2

A 1 1B 1 0C 0 1D 0 0

§ a =#rowsoftypeA,etc.¡ Note: sim(C1,C2)=a/(a+b+c)¡ Then: Pr[h(C1)=h(C2)]=Sim(C1,C2)

§ LookdownthecolsC1 andC2 untilweseea1§ Ifit’satype-A row,thenh(C1)=h(C2)Ifatype-B ortype-C row,thennot

37J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

38

¡ Weknow:Pr[hp(C1)=hp(C2)]=sim(C1,C2)¡ Nowgeneralizetomultiplehashfunctions

¡ Thesimilarityoftwosignaturesisthefractionofthehashfunctionsinwhichtheyagree

¡ Note: BecauseoftheMin-Hashproperty,thesimilarityofcolumnsisthesameastheexpectedsimilarityoftheirsignatures

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

5/12/17

20

39J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Similarities:1-3 2-4 1-2 3-4

Col/Col 0.75 0.75 0 0Sig/Sig 0.67 1.00 0 0

Signature matrix M

1212

5

7

6

3

1

2

4

1412

4

5

1

6

7

3

2

2121

0101

0101

1010

1010

1010

1001

0101

Input matrix (Shingles x Documents)

3

4

7

2

6

1

5

Permutation p

¡ PickK=100randompermutationsoftherows¡ Thinkofsig(C) asacolumnvector¡ sig(C)[i]= accordingtothei-th permutation,theindexofthefirstrowthathasa1incolumnCsig(C)[i] = min (pi(C))

¡ Note: Thesketch(signature)ofdocumentC issmall~𝟏𝟎𝟎 bytes!

¡ Weachievedourgoal! We“compressed”longbitvectorsintoshortsignatures

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 40

5/12/17

21

¡ Permutingrowsevenonceisprohibitive¡ Rowhashing!

§ PickK=100 hashfunctionski§ Orderingunderki givesarandomrowpermutation!

¡ One-passimplementation§ ForeachcolumnC andhash-func.ki keepa“slot”forthemin-hashvalue

§ Initializeallsig(C)[i]=¥§ Scanrowslookingfor1s

§ Supposerowj has1incolumnC§ Thenforeachki :

§ Ifki(j)<sig(C)[i],thensig(C)[i]¬ ki(j)J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 41

How to pick a randomhash function h(x)?Universal hashing:ha,b(x)=((a·x+b) mod p) mod Nwhere:a,b … random integersp … prime number (p > N)

¡ Suppose1billionrows.

¡ Hardtopickarandompermutationfrom1…billion.

¡ Representingarandompermutationrequires1billionentries.

¡ Accessingrowsinpermutedorderleadstothrashing.

42

5/12/17

22

¡ Agoodapproximationtopermutingrows:pick100(?)hashfunctions.

¡ Foreachcolumnc andeachhashfunctionhi,keepa“slot” M(i,c).

¡ Intent:M(i,c)willbecomethesmallestvalueofhi(r )forwhichcolumnc has1inrowr.§ I.e.,hi(r )givesorderofrowsfor i thpermuation.

43

Minhashing

InitializeM(i,c)to∞foralliandcfor eachrowrfor eachcolumnc

if chas1inrowrfor eachhashfunctionhi do

if hi(r)isasmallervaluethanM(i,c)then

M(i,c):=hi(r );

44

Minhashing

5/12/17

23

45

Row C1 C21 1 02 0 13 1 14 1 05 0 1

h(x) = x mod 5g(x) = 2x+1 mod 5

Sig1 Sig2

Minhashing

h(1) = 1 1 -g(1) = 3 3 -h(2) = 2 1 2g(2) = 0 3 0

h(3) = 3 1 2g(3) = 2 2 0h(4) = 4 1 2g(4) = 4 2 0h(5) = 0 1 0g(5) = 1 2 0

¡ Often,dataisgivenbycolumn,notrow.§ E.g.,columns=documents,rows=shingles.

¡ Ifso,sortmatrixoncesoitisbyrow.

¡ Andalways computehi(r)onlyonceforeachrow.

46

Minhashing

5/12/17

24

Step3:Locality-SensitiveHashing:Focusonpairsofsignatureslikelytobefromsimilardocuments

Docu-ment

The setof stringsof length kthat appearin the doc-ument

Signatures:short integervectors thatrepresent thesets, andreflect theirsimilarity

Locality-SensitiveHashing

Candidatepairs:those pairsof signaturesthat we needto test forsimilarity

¡ Supposewehave,inmainmemory,datarepresentingalargenumberofobjects.§ Maybetheobjectsthemselves.§ Maybesignaturesasinminhashing.

¡ Wewanttocompareeachtoeach,findingthosepairs thataresufficientlysimilar.

48

5/12/17

25

¡ Whilethesignatures ofallcolumnsmayfitinmainmemory,comparingthesignaturesofallpairs ofcolumnsisquadratic inthenumberofcolumns.

¡ Example:106 columnsimplies5*1011 column-comparisons.

¡ At1microsecond/comparison:6days.

49

¡ Goal:FinddocumentswithJaccard similarityatleasts (forsomesimilaritythreshold,e.g., s=0.8)

¡ LSH– Generalidea: Useafunctionf(x,y) thattellswhetherx andy isacandidatepair: apairofelementswhosesimilaritymustbeevaluated

¡ ForMin-Hashmatrices:§ HashcolumnsofsignaturematrixM tomanybuckets§ Eachpairofdocumentsthathashesintothesamebucketisacandidatepair

50J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

1212

1412

2121

5/12/17

26

¡ Pickasimilaritythresholds (0<s<1)

¡ Columnsx andy ofM areacandidatepair iftheirsignaturesagreeonatleastfractions oftheirrows:M (i,x)=M (i,y) foratleastfrac.s valuesofi§ Weexpectdocumentsx andy tohavethesame(Jaccard)similarityastheirsignatures

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 51

1212

1412

2121

¡ Bigidea: HashcolumnsofsignaturematrixM severaltimes

¡ Arrangethat(only)similarcolumns arelikelytohashtothesamebucket,withhighprobability

¡ Candidatepairsarethosethathashtothesamebucket

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 52

1212

1412

2121

5/12/17

27

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 53

Signature matrix M

r rowsper band

b bands

Onesignature

1212

1412

2121

¡ DividematrixM intob bandsofr rows

¡ Foreachband,hashitsportionofeachcolumntoahashtablewithk buckets§ Makek aslargeaspossible

¡ Candidate columnpairsarethosethathashtothesamebucketfor≥ 1 band

¡ Tune b andr tocatchmostsimilarpairs,butfewnon-similarpairs

54J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

5/12/17

28

Matrix M

r rows b bands

BucketsColumns 2 and 6are probably identical (candidate pair)

Columns 6 and 7 aresurely different.

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 55

¡ Thereareenoughbuckets thatcolumnsareunlikelytohashtothesamebucketunlesstheyareidentical inaparticularband

¡ Hereafter,weassumethat“samebucket”means“identicalinthatband”

¡ Assumptionneededonlytosimplifyanalysis,notforcorrectnessofalgorithm

56J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

5/12/17

29

Assumethefollowingcase:¡ Suppose100,000columnsofM (100kdocs)¡ Signaturesof100integers(rows)¡ Therefore,signaturestake40Mb¡ Chooseb =20bandsofr =5integers/band

¡ Goal: Findpairsofdocumentsthatareatleasts =0.8 similar

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 57

1212

1412

2121

¡ Findpairsof³ s=0.8similarity,setb=20,r=5¡ Assume: sim(C1,C2)=0.8

§ Sincesim(C1,C2)³ s,wewantC1,C2 tobeacandidatepair:Wewantthemtohashtoat least1commonbucket(atleastonebandisidentical)

¡ ProbabilityC1,C2 identicalinoneparticularband:(0.8)5 =0.328

¡ ProbabilityC1,C2 arenot similarinallofthe20bands:(1-0.328)20 =0.00035§ i.e.,about1/3000thofthe80%-similarcolumnpairsarefalsenegatives (wemissthem)

§ Wewouldfind99.965%pairsoftrulysimilardocuments

58J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

1212

1412

2121

5/12/17

30

¡ Findpairsof³ s=0.8similarity,setb=20,r=5¡ Assume: sim(C1,C2)=0.3

§ Sincesim(C1,C2)<s wewantC1,C2 tohashtoNOcommonbuckets (allbandsshouldbedifferent)

¡ ProbabilityC1,C2 identicalinoneparticularband:(0.3)5 =0.00243

¡ ProbabilityC1,C2 identicalinatleast1of20bands:1- (1- 0.00243)20 =0.0474§ Inotherwords,approximately4.74%pairsofdocswithsimilarity0.3endupbecomingcandidatepairs§ Theyarefalsepositivessincewewillhavetoexaminethem(theyarecandidatepairs)butthenitwillturnouttheirsimilarityisbelowthresholds

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 59

1212

1412

2121

¡ Pick:§ ThenumberofMin-Hashes(rowsofM)§ Thenumberofbandsb,and§ Thenumberofrowsr perbandtobalancefalsepositives/negatives

¡ Example: Ifwehadonly15bandsof5rows,thenumberoffalsepositiveswouldgodown,butthenumberoffalsenegativeswouldgoup

60J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

1212

1412

2121

5/12/17

31

Similarity t =sim(C1, C2) of two sets

Probabilityof sharinga bucket

Sim

ilarit

y th

resh

old

sNo chance

if t < s

Probability = 1 if t > s

61J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 62

Remember:Probability ofequal hash-values= similarity

Similarity t =sim(C1, C2) of two sets

Probabilityof sharinga bucket

5/12/17

32

¡ ColumnsC1 andC2 havesimilarityt¡ Pickanyband(r rows)§ Prob.thatallrowsinbandequal= tr

§ Prob.thatsomerowinbandunequal=1- tr

¡ Prob.thatnobandidentical=(1- tr)b

¡ Prob.thatatleast1bandidentical=1- (1- tr)b

63J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

t r

All rowsof a bandare equal

1 -

Some rowof a bandunequal

( )b

No bandsidentical

1 -

At leastone bandidentical

s ~ (1/b)1/r

64J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Similarity t=sim(C1, C2) of two sets

Probabilityof sharinga bucket

5/12/17

33

¡ Similaritythresholds¡ Prob.thatatleast1bandisidentical:

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 65

s 1-(1-sr)b

.2 .006

.3 .047

.4 .186

.5 .470

.6 .802

.7 .975

.8 .9996

¡ Pickingr andb togetthebestS-curve§ 50hash-functions(r=5,b=10)

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 66

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Blue area: False Negative rateGreen area: False Positive rate

Similarity

Prob

. sha

ring

a bu

cket

5/12/17

34

¡ TuneM,b,r togetalmostallpairswithsimilarsignatures,buteliminatemostpairsthatdonothavesimilarsignatures

¡ Checkinmainmemorythatcandidatepairsreallydohavesimilarsignatures

¡ Optional: Inanotherpassthroughdata,checkthattheremainingcandidatepairsreallyrepresentsimilardocuments

67J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

¡ Shingling: Convertdocumentstosets§ WeusedhashingtoassigneachshingleanID

¡ Min-Hashing:Convertlargesetstoshortsignatures,whilepreservingsimilarity§ Weusedsimilaritypreservinghashing togeneratesignatureswithpropertyPr[hp(C1)=hp(C2)]=sim(C1,C2)

§ Weusedhashingtogetaroundgeneratingrandompermutations

¡ Locality-SensitiveHashing:Focusonpairsofsignatureslikelytobefromsimilardocuments§ Weusedhashingtofindcandidatepairs ofsimilarity³ s

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 68

Recommended