12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

1

Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

http://www.mmds.org

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

High dim. data

Localitysensitivehashing

Clustering

Dimensionality

reduction

Graph data

PageRank,SimRank

NetworkAnalysis

SpamDetection

Infinite data

Filteringdata

streams

Webadvertising

Queriesonstreams

Machine learning

SVM

DecisionTrees

Perceptron,kNN

Apps

Recommendersystems

AssociationRules

Duplicatedocumentdetection

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 2

5/12/17

2

3J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

[Hays and Efros, SIGGRAPH 2007]



5/12/17

3

10 nearest neighbors from a collection of 20,000 imagesJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 5


10 nearest neighbors from a collection of 2 million imagesJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 6


5/12/17

4

¡ Manyproblemscanbeexpressedasfinding“similar”sets:§ Findnear-neighborsinhigh-dimensional space

¡ Examples:§ Pageswithsimilarwords

§ Forduplicatedetection,classificationbytopic§ Customerswhopurchasedsimilarproducts

§ Productswithsimilarcustomersets§ Imageswithsimilarfeatures

§ Userswhovisitedsimilarwebsites


¡ Given:Highdimensionaldatapoints𝒙𝟏, 𝒙𝟐, …§ Forexample: Imageisalongvectorofpixelcolors

1 2 10 2 10 1 0

→ [121021010]

¡ Andsomedistancefunction𝒅(𝒙𝟏, 𝒙𝟐)§ Whichquantifiesthe“distance”between𝒙𝟏 and𝒙𝟐

¡ Goal: Findallpairsofdatapoints(𝒙𝒊, 𝒙𝒋) thatarewithinsomedistancethreshold𝒅 𝒙𝒊, 𝒙𝒋 ≤ 𝒔

¡ Note: Naïvesolutionwouldtake𝑶 𝑵𝟐 Lwhere𝑵 isthenumberofdatapoints

¡ MAGIC:Thiscanbedonein𝑶 𝑵 !!How?J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 8

5/12/17

5

¡ Lasttime: Findingfrequentpairs


Item

s 1…

N

Items 1…N

Count of pair {i,j} in the data

Naïve solution:Single pass but requires space quadratic in the number of items

Item

s 1…

K

Items 1…K

Count of pair {i,j}in the data

A-Priori:First pass: Find frequent singletonsFor a pair to be a frequent pair candidate, its singletons have to be frequent!Second pass:Count only candidate pairs!

N … number of distinct itemsK … number of items with support ³ s

¡ Lasttime: Findingfrequentpairs¡ Furtherimprovement:PCY§ Pass1:

§ Countexactfrequencyofeachitem:§ Takepairsofitems{i,j},hashthemintoB bucketsandcountofthenumberofpairsthathashedtoeachbucket:


Items 1…N

Basket 1: {1,2,3}Pairs: {1,2} {1,3} {2,3}

Buckets 1…B2 1

5/12/17

6



§ Pass2:§ Forapair{i,j}tobeacandidateforafrequentpair,itssingletons{i},{j}havetobefrequentandthepairhastohashtoafrequentbucket!


Items 1…N

Basket 1: {1,2,3}Pairs: {1,2} {1,3} {2,3}Basket 2: {1,2,4}Pairs: {1,2} {1,4} {2,4}

Buckets 1…B3 1 2



§ Pass2:§ Forapair{i,j}tobeacandidateforafrequentpair,itssingletonshavetobefrequentanditshastohashtoafrequentbucket!


Items 1…N

Basket 1: {1,2,3}Pairs: {1,2} {1,3} {2,3}Basket 2: {1,2,4}Pairs: {1,2} {1,4} {2,4}

Buckets 1…B3 1 2

Previous lecture: A-PrioriMain idea: CandidatesInstead of keeping a count of each pair, only keep a count of candidate pairs!

Today’s lecture: Find pairs of similar docsMain idea: Candidates-- Pass 1: Take documents and hash them to buckets such that documents that are similar hash to the same bucket-- Pass 2: Only compare documents that are candidates (i.e., they hashed to a same bucket)Benefits: Instead of O(N2) comparisons, we need O(N) comparisons to find similar documents

5/12/17

7

¡ Goal: Findnear-neighborsinhigh-dim.space§ Weformallydefine“nearneighbors”aspointsthatarea“smalldistance”apart

¡ Foreachapplication,wefirstneedtodefinewhat“distance”means

¡ Today:Jaccard distance/similarity§ TheJaccard similarity oftwosets isthesizeoftheirintersectiondividedbythesizeoftheirunion:sim(C1,C2)=|C1ÇC2|/|C1ÈC2|

§ Jaccard distance: d(C1,C2)=1- |C1ÇC2|/|C1ÈC2|


3 in intersection8 in unionJaccard similarity= 3/8Jaccard distance = 5/8

5/12/17

8

¡ Goal: Givenalargenumber(𝑵 inthemillionsorbillions)ofdocuments,find“nearduplicate”pairs

¡ Applications:§ Mirrorwebsites,orapproximatemirrors

§ Don’twanttoshowbothinsearchresults§ Similarnewsarticlesatmanynewssites

§ Clusterarticlesby“samestory”¡ Problems:

§ Manysmallpiecesofonedocumentcanappearoutoforderinanother

§ Toomanydocumentstocompareallpairs§ Documentsaresolargeorsomanythattheycannotfitinmainmemory


1. Shingling: Convertdocumentstosets

2. Min-Hashing: Convertlargesetstoshortsignatures,whilepreservingsimilarity

3. Locality-SensitiveHashing: Focusonpairsofsignatureslikelytobefromsimilardocuments

§ Candidatepairs!


5/12/17

9

17

Docu-ment

The setof stringsof length kthat appearin the doc-ument

Signatures:short integervectors thatrepresent thesets, andreflect theirsimilarity

Locality-SensitiveHashing

Candidatepairs:those pairsof signaturesthat we needto test forsimilarity

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Step1: Shingling: Convertdocumentstosets

Docu-ment


5/12/17

10

¡ Step1: Shingling: Convertdocumentstosets

¡ Simpleapproaches:§ Document=setofwordsappearingindocument§ Document=setof“important”words§ Don’tworkwellforthisapplication.Why?

¡ Needtoaccountfororderingofwords!¡ Adifferentway:Shingles!


¡ Ak-shingle (ork-gram)foradocumentisasequenceofktokensthatappearsinthedoc§ Tokenscanbecharacters,wordsorsomethingelse,dependingontheapplication

§ Assumetokens=charactersforexamples

¡ Example: k=2;documentD1=abcabSetof2-shingles:S(D1)={ab,bc,ca}§ Option: Shinglesasabag(multiset),countabtwice:S’(D1)={ab,bc,ca, ab}


5/12/17

11

¡ Tocompresslongshingles,wecanhash themto(say)4bytes

¡ Representadocumentbythesetofhashvaluesofitsk-shingles§ Idea: Twodocumentscould(rarely)appeartohaveshinglesincommon,wheninfactonlythehash-valueswereshared

¡ Example: k=2;documentD1= abcabSetof2-shingles:S(D1)={ab,bc,ca}Hashthesingles:h(D1)={1,5,7}


¡ DocumentD1isasetofitsk-shinglesC1=S(D1)¡ Equivalently,eachdocumentisa0/1vectorinthespaceofk-shingles§ Eachuniqueshingleisadimension§ Vectorsareverysparse

¡ AnaturalsimilaritymeasureistheJaccard similarity:

sim(D1,D2)=|C1ÇC2|/|C1ÈC2|


5/12/17

12

¡ Documentsthathavelotsofshinglesincommonhavesimilartext,evenifthetextappearsindifferentorder

¡ Caveat: Youmustpickk largeenough,ormostdocumentswillhavemostshingles§ k =5isOKforshortdocuments§ k =10isbetterforlongdocuments


¡ Supposeweneedtofindnear-duplicatedocumentsamong𝑵 = 𝟏milliondocuments

¡ Naïvely,wewouldhavetocomputepairwiseJaccard similaritiesforeverypairofdocs§ 𝑵(𝑵 − 𝟏)/𝟐 ≈5*1011 comparisons§ At105 secs/dayand106 comparisons/sec,itwouldtake5days

¡ For𝑵 = 𝟏𝟎million,ittakesmorethanayear…


5/12/17

13

Step2:Minhashing: Convertlargesets toshortsignatures,whilepreservingsimilarity

Docu-ment



¡ Manysimilarityproblemscanbeformalizedasfindingsubsetsthathavesignificantintersection

¡ Encodesetsusing0/1(bit,boolean)vectors§ Onedimensionperelementintheuniversalset

¡ InterpretsetintersectionasbitwiseAND,andsetunionasbitwiseOR

¡ Example: C1 =10111;C2 =10011§ Sizeofintersection=3;sizeofunion=4,§ Jaccard similarity (notdistance)=3/4§ Distance:d(C1,C2)=1– (Jaccard similarity)=1/4


5/12/17

14

¡ Rows =elements(shingles)¡ Columns =sets(documents)§ 1inrowe andcolumns ifandonlyife isamemberofs

§ ColumnsimilarityistheJaccardsimilarityofthecorrespondingsets(rowswithvalue1)

§ Typicalmatrixissparse!¡ Eachdocumentisacolumn:

§ Example: sim(C1 ,C2)=?§ Sizeofintersection=3;sizeofunion=6,Jaccard similarity(notdistance)=3/6

§ d(C1,C2)=1– (Jaccard similarity)=3/627J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

0101

0111

1001

1000

10101011

0111 Documents

Shin

gles

¡ Wemightnotreallyrepresentthedatabyabooleanmatrix.

¡ Sparsematricesareusuallybetterrepresentedbythelistofplaceswherethereisanon-zerovalue.

¡ Butthematrixpictureisconceptuallyuseful.

28

5/12/17

15

1. Whenthesetsaresolargeorsomanythattheycannotfitinmainmemory.

2. Or,whentherearesomanysetsthatcomparingallpairsofsetstakestoomuchtime.

3. Orboth.

29

¡ Sofar:§ Documents® Setsofshingles§ Representsetsasboolean vectorsinamatrix

¡ Nextgoal:Findsimilarcolumnswhilecomputingsmallsignatures§ Similarityofcolumns==similarityofsignatures


5/12/17

16

¡ NextGoal:Findsimilarcolumns,Smallsignatures¡ Naïveapproach:

§ 1)Signaturesofcolumns: smallsummariesofcolumns§ 2)Examinepairsofsignatures tofindsimilarcolumns

§ Essential: Similaritiesofsignaturesandcolumnsarerelated

§ 3)Optional: Checkthatcolumnswithsimilarsignaturesarereallysimilar

¡ Warnings:§ Comparingallpairsmaytaketoomuchtime:JobforLSH

§ Thesemethodscanproducefalsenegatives,andevenfalsepositives(iftheoptionalcheckisnotmade)


¡ Keyidea: “hash”eachcolumnC toasmallsignature h(C),suchthat:§ (1) h(C) issmallenoughthatthesignaturefitsinRAM§ (2) sim(C1,C2) isthesameasthe“similarity”ofsignaturesh(C1)andh(C2)

¡ Goal: Findahashfunctionh(·) suchthat:§ Ifsim(C1,C2) ishigh,thenwithhighprob.h(C1)=h(C2)§ Ifsim(C1,C2) islow,thenwithhighprob.h(C1)≠h(C2)

¡ Hashdocsintobuckets.Expectthat“most”pairsofnearduplicatedocshashintothesamebucket!


5/12/17

17

¡ Goal: Findahashfunctionh(·) suchthat:§ ifsim(C1,C2) ishigh,thenwithhighprob.h(C1)=h(C2)§ ifsim(C1,C2) islow,thenwithhighprob.h(C1)≠h(C2)

¡ Clearly,thehashfunctiondependsonthesimilaritymetric:§ Notallsimilaritymetricshaveasuitablehashfunction

¡ ThereisasuitablehashfunctionfortheJaccard similarity: ItiscalledMin-Hashing


34

¡ Imaginetherowsoftheboolean matrixpermutedunderrandompermutationp

¡ Definea“hash”functionhp(C) =theindexofthefirst (inthepermutedorderp)rowinwhichcolumnC hasvalue1:

hp (C) = minp p(C)

¡ Useseveral(e.g.,100)independenthashfunctions(thatis,permutations)tocreateasignatureofacolumn


5/12/17

18

35

3

4

7

2

6

1

5

Signature matrix M

1212

5

7

6

3

1

2

4

1412

4

5

1

6

7

3

2

2121


2nd element of the permutation is the first to map to a 1

4th element of the permutation is the first to map to a 1

0101

0101

1010

1010

1010

1001

0101

Input matrix (Shingles x Documents) Permutation p

Note: Another (equivalent) way is to store row indexes: 1 5 1 5

2 3 1 36 4 6 4

¡ Choosearandompermutationp¡ Claim: Pr[hp(C1)=hp(C2)]=sim(C1,C2)¡ Why?

§ LetX beadoc(setofshingles),yÎ X isashingle§ Then: Pr[p(y)=min(p(X))]=1/|X|

§ ItisequallylikelythatanyyÎ X ismappedtothemin element

§ Lety bes.t. p(y)=min(p(C1ÈC2))§ Theneither: p(y)=min(p(C1))ifyÎ C1,or

p(y)=min(p(C2))ifyÎ C2§ Sotheprob.thatboth aretrueistheprob.y Î C1Ç C2§ Pr[min(p(C1))=min(p(C2))]=|C1ÇC2|/|C1ÈC2|=sim(C1,C2)


01

10

00

11

00

00

One of the twocols had to have1 at position y

5/12/17

19

¡ GivencolsC1 andC2,rowsmaybeclassifiedas:C1 C2

A 1 1B 1 0C 0 1D 0 0

§ a =#rowsoftypeA,etc.¡ Note: sim(C1,C2)=a/(a+b+c)¡ Then: Pr[h(C1)=h(C2)]=Sim(C1,C2)

§ LookdownthecolsC1 andC2 untilweseea1§ Ifit’satype-A row,thenh(C1)=h(C2)Ifatype-B ortype-C row,thennot


38

¡ Weknow:Pr[hp(C1)=hp(C2)]=sim(C1,C2)¡ Nowgeneralizetomultiplehashfunctions

¡ Thesimilarityoftwosignaturesisthefractionofthehashfunctionsinwhichtheyagree

¡ Note: BecauseoftheMin-Hashproperty,thesimilarityofcolumnsisthesameastheexpectedsimilarityoftheirsignatures


5/12/17

20


Similarities:1-3 2-4 1-2 3-4

Col/Col 0.75 0.75 0 0Sig/Sig 0.67 1.00 0 0

Signature matrix M

1212

5

7

6

3

1

2

4

1412

4

5

1

6

7

3

2

2121

0101

0101

1010

1010

1010

1001

0101

Input matrix (Shingles x Documents)

3

4

7

2

6

1

5

Permutation p

¡ PickK=100randompermutationsoftherows¡ Thinkofsig(C) asacolumnvector¡ sig(C)[i]= accordingtothei-th permutation,theindexofthefirstrowthathasa1incolumnCsig(C)[i] = min (pi(C))

¡ Note: Thesketch(signature)ofdocumentC issmall~𝟏𝟎𝟎 bytes!

¡ Weachievedourgoal! We“compressed”longbitvectorsintoshortsignatures


5/12/17

21

¡ Permutingrowsevenonceisprohibitive¡ Rowhashing!

§ PickK=100 hashfunctionski§ Orderingunderki givesarandomrowpermutation!

¡ One-passimplementation§ ForeachcolumnC andhash-func.ki keepa“slot”forthemin-hashvalue

§ Initializeallsig(C)[i]=¥§ Scanrowslookingfor1s

§ Supposerowj has1incolumnC§ Thenforeachki :

§ Ifki(j)<sig(C)[i],thensig(C)[i]¬ ki(j)J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 41

How to pick a randomhash function h(x)?Universal hashing:ha,b(x)=((a·x+b) mod p) mod Nwhere:a,b … random integersp … prime number (p > N)

¡ Suppose1billionrows.

¡ Hardtopickarandompermutationfrom1…billion.

¡ Representingarandompermutationrequires1billionentries.

¡ Accessingrowsinpermutedorderleadstothrashing.

42

5/12/17

22

¡ Agoodapproximationtopermutingrows:pick100(?)hashfunctions.

¡ Foreachcolumnc andeachhashfunctionhi,keepa“slot” M(i,c).

¡ Intent:M(i,c)willbecomethesmallestvalueofhi(r )forwhichcolumnc has1inrowr.§ I.e.,hi(r )givesorderofrowsfor i thpermuation.

43

Minhashing

InitializeM(i,c)to∞foralliandcfor eachrowrfor eachcolumnc

if chas1inrowrfor eachhashfunctionhi do

if hi(r)isasmallervaluethanM(i,c)then

M(i,c):=hi(r );

44

Minhashing

5/12/17

23

45

Row C1 C21 1 02 0 13 1 14 1 05 0 1

h(x) = x mod 5g(x) = 2x+1 mod 5

Sig1 Sig2

Minhashing

h(1) = 1 1 -g(1) = 3 3 -h(2) = 2 1 2g(2) = 0 3 0

h(3) = 3 1 2g(3) = 2 2 0h(4) = 4 1 2g(4) = 4 2 0h(5) = 0 1 0g(5) = 1 2 0

¡ Often,dataisgivenbycolumn,notrow.§ E.g.,columns=documents,rows=shingles.

¡ Ifso,sortmatrixoncesoitisbyrow.

¡ Andalways computehi(r)onlyonceforeachrow.

46

Minhashing

5/12/17

24

Step3:Locality-SensitiveHashing:Focusonpairsofsignatureslikelytobefromsimilardocuments

Docu-ment



Locality-SensitiveHashing

Candidatepairs:those pairsof signaturesthat we needto test forsimilarity

¡ Supposewehave,inmainmemory,datarepresentingalargenumberofobjects.§ Maybetheobjectsthemselves.§ Maybesignaturesasinminhashing.

¡ Wewanttocompareeachtoeach,findingthosepairs thataresufficientlysimilar.

48

5/12/17

25

¡ Whilethesignatures ofallcolumnsmayfitinmainmemory,comparingthesignaturesofallpairs ofcolumnsisquadratic inthenumberofcolumns.

¡ Example:106 columnsimplies5*1011 column-comparisons.

¡ At1microsecond/comparison:6days.

49

¡ Goal:FinddocumentswithJaccard similarityatleasts (forsomesimilaritythreshold,e.g., s=0.8)

¡ LSH– Generalidea: Useafunctionf(x,y) thattellswhetherx andy isacandidatepair: apairofelementswhosesimilaritymustbeevaluated

¡ ForMin-Hashmatrices:§ HashcolumnsofsignaturematrixM tomanybuckets§ Eachpairofdocumentsthathashesintothesamebucketisacandidatepair


1212

1412

2121

5/12/17

26

¡ Pickasimilaritythresholds (0<s<1)

¡ Columnsx andy ofM areacandidatepair iftheirsignaturesagreeonatleastfractions oftheirrows:M (i,x)=M (i,y) foratleastfrac.s valuesofi§ Weexpectdocumentsx andy tohavethesame(Jaccard)similarityastheirsignatures


1212

1412

2121

¡ Bigidea: HashcolumnsofsignaturematrixM severaltimes

¡ Arrangethat(only)similarcolumns arelikelytohashtothesamebucket,withhighprobability

¡ Candidatepairsarethosethathashtothesamebucket


1212

1412

2121

5/12/17

27


Signature matrix M

r rowsper band

b bands

Onesignature

1212

1412

2121

¡ DividematrixM intob bandsofr rows

¡ Foreachband,hashitsportionofeachcolumntoahashtablewithk buckets§ Makek aslargeaspossible

¡ Candidate columnpairsarethosethathashtothesamebucketfor≥ 1 band

¡ Tune b andr tocatchmostsimilarpairs,butfewnon-similarpairs


5/12/17

28

Matrix M

r rows b bands

BucketsColumns 2 and 6are probably identical (candidate pair)

Columns 6 and 7 aresurely different.


¡ Thereareenoughbuckets thatcolumnsareunlikelytohashtothesamebucketunlesstheyareidentical inaparticularband

¡ Hereafter,weassumethat“samebucket”means“identicalinthatband”

¡ Assumptionneededonlytosimplifyanalysis,notforcorrectnessofalgorithm


5/12/17

29

Assumethefollowingcase:¡ Suppose100,000columnsofM (100kdocs)¡ Signaturesof100integers(rows)¡ Therefore,signaturestake40Mb¡ Chooseb =20bandsofr =5integers/band

¡ Goal: Findpairsofdocumentsthatareatleasts =0.8 similar


1212

1412

2121

¡ Findpairsof³ s=0.8similarity,setb=20,r=5¡ Assume: sim(C1,C2)=0.8

§ Sincesim(C1,C2)³ s,wewantC1,C2 tobeacandidatepair:Wewantthemtohashtoat least1commonbucket(atleastonebandisidentical)

¡ ProbabilityC1,C2 identicalinoneparticularband:(0.8)5 =0.328

¡ ProbabilityC1,C2 arenot similarinallofthe20bands:(1-0.328)20 =0.00035§ i.e.,about1/3000thofthe80%-similarcolumnpairsarefalsenegatives (wemissthem)

§ Wewouldfind99.965%pairsoftrulysimilardocuments


1212

1412

2121

5/12/17

30

¡ Findpairsof³ s=0.8similarity,setb=20,r=5¡ Assume: sim(C1,C2)=0.3

§ Sincesim(C1,C2)<s wewantC1,C2 tohashtoNOcommonbuckets (allbandsshouldbedifferent)

¡ ProbabilityC1,C2 identicalinoneparticularband:(0.3)5 =0.00243

¡ ProbabilityC1,C2 identicalinatleast1of20bands:1- (1- 0.00243)20 =0.0474§ Inotherwords,approximately4.74%pairsofdocswithsimilarity0.3endupbecomingcandidatepairs§ Theyarefalsepositivessincewewillhavetoexaminethem(theyarecandidatepairs)butthenitwillturnouttheirsimilarityisbelowthresholds


1212

1412

2121

¡ Pick:§ ThenumberofMin-Hashes(rowsofM)§ Thenumberofbandsb,and§ Thenumberofrowsr perbandtobalancefalsepositives/negatives

¡ Example: Ifwehadonly15bandsof5rows,thenumberoffalsepositiveswouldgodown,butthenumberoffalsenegativeswouldgoup


1212

1412

2121

5/12/17

31

Similarity t =sim(C1, C2) of two sets

Probabilityof sharinga bucket

Sim

ilarit

y th

resh

old

sNo chance

if t < s

Probability = 1 if t > s



Remember:Probability ofequal hash-values= similarity

Similarity t =sim(C1, C2) of two sets


5/12/17

32

¡ ColumnsC1 andC2 havesimilarityt¡ Pickanyband(r rows)§ Prob.thatallrowsinbandequal= tr

§ Prob.thatsomerowinbandunequal=1- tr

¡ Prob.thatnobandidentical=(1- tr)b

¡ Prob.thatatleast1bandidentical=1- (1- tr)b


t r

All rowsof a bandare equal

1 -

Some rowof a bandunequal

( )b

No bandsidentical

1 -

At leastone bandidentical

s ~ (1/b)1/r


Similarity t=sim(C1, C2) of two sets


5/12/17

33

¡ Similaritythresholds¡ Prob.thatatleast1bandisidentical:


s 1-(1-sr)b

.2 .006

.3 .047

.4 .186

.5 .470

.6 .802

.7 .975

.8 .9996

¡ Pickingr andb togetthebestS-curve§ 50hash-functions(r=5,b=10)


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Blue area: False Negative rateGreen area: False Positive rate

Similarity

Prob

. sha

ring

a bu

cket

5/12/17

34

¡ TuneM,b,r togetalmostallpairswithsimilarsignatures,buteliminatemostpairsthatdonothavesimilarsignatures

¡ Checkinmainmemorythatcandidatepairsreallydohavesimilarsignatures

¡ Optional: Inanotherpassthroughdata,checkthattheremainingcandidatepairsreallyrepresentsimilardocuments


¡ Shingling: Convertdocumentstosets§ WeusedhashingtoassigneachshingleanID

¡ Min-Hashing:Convertlargesetstoshortsignatures,whilepreservingsimilarity§ Weusedsimilaritypreservinghashing togeneratesignatureswithpropertyPr[hp(C1)=hp(C2)]=sim(C1,C2)

§ Weusedhashingtogetaroundgeneratingrandompermutations

¡ Locality-SensitiveHashing:Focusonpairsofsignatureslikelytobefromsimilardocuments§ Weusedhashingtofindcandidatepairs ofsimilarity³ s