34
5/12/17 1 Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org High dim. data Locality sensitive hashing Clustering Dimensional ity reduction Graph data PageRank, SimRank Network Analysis Spam Detection Infinite data Filtering data streams Web advertising Queries on streams Machine learning SVM Decision Trees Perceptron, kNN Apps Recommen der systems Association Rules Duplicate document detection J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

1

Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

http://www.mmds.org

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

High dim. data

Localitysensitivehashing

Clustering

Dimensionality

reduction

Graph data

PageRank,SimRank

NetworkAnalysis

SpamDetection

Infinite data

Filteringdata

streams

Webadvertising

Queriesonstreams

Machine learning

SVM

DecisionTrees

Perceptron,kNN

Apps

Recommendersystems

AssociationRules

Duplicatedocumentdetection

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 2

Page 2: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

2

3J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

[Hays and Efros, SIGGRAPH 2007]

4J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

[Hays and Efros, SIGGRAPH 2007]

Page 3: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

3

10 nearest neighbors from a collection of 20,000 imagesJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 5

[Hays and Efros, SIGGRAPH 2007]

10 nearest neighbors from a collection of 2 million imagesJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 6

[Hays and Efros, SIGGRAPH 2007]

Page 4: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

4

¡ Manyproblemscanbeexpressedasfinding“similar”sets:§ Findnear-neighborsinhigh-dimensional space

¡ Examples:§ Pageswithsimilarwords

§ Forduplicatedetection,classificationbytopic§ Customerswhopurchasedsimilarproducts

§ Productswithsimilarcustomersets§ Imageswithsimilarfeatures

§ Userswhovisitedsimilarwebsites

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 7

¡ Given:Highdimensionaldatapoints𝒙𝟏, 𝒙𝟐, …§ Forexample: Imageisalongvectorofpixelcolors

1 2 10 2 10 1 0

→ [121021010]

¡ Andsomedistancefunction𝒅(𝒙𝟏, 𝒙𝟐)§ Whichquantifiesthe“distance”between𝒙𝟏 and𝒙𝟐

¡ Goal: Findallpairsofdatapoints(𝒙𝒊, 𝒙𝒋) thatarewithinsomedistancethreshold𝒅 𝒙𝒊, 𝒙𝒋 ≤ 𝒔

¡ Note: Naïvesolutionwouldtake𝑶 𝑵𝟐 Lwhere𝑵 isthenumberofdatapoints

¡ MAGIC:Thiscanbedonein𝑶 𝑵 !!How?J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 8

Page 5: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

5

¡ Lasttime: Findingfrequentpairs

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 9

Item

s 1…

N

Items 1…N

Count of pair {i,j} in the data

Naïve solution:Single pass but requires space quadratic in the number of items

Item

s 1…

K

Items 1…K

Count of pair {i,j}in the data

A-Priori:First pass: Find frequent singletonsFor a pair to be a frequent pair candidate, its singletons have to be frequent!Second pass:Count only candidate pairs!

N … number of distinct itemsK … number of items with support ³ s

¡ Lasttime: Findingfrequentpairs¡ Furtherimprovement:PCY§ Pass1:

§ Countexactfrequencyofeachitem:§ Takepairsofitems{i,j},hashthemintoB bucketsandcountofthenumberofpairsthathashedtoeachbucket:

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 10

Items 1…N

Basket 1: {1,2,3}Pairs: {1,2} {1,3} {2,3}

Buckets 1…B2 1

Page 6: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

6

¡ Lasttime: Findingfrequentpairs¡ Furtherimprovement:PCY§ Pass1:

§ Countexactfrequencyofeachitem:§ Takepairsofitems{i,j},hashthemintoB bucketsandcountofthenumberofpairsthathashedtoeachbucket:

§ Pass2:§ Forapair{i,j}tobeacandidateforafrequentpair,itssingletons{i},{j}havetobefrequentandthepairhastohashtoafrequentbucket!

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 11

Items 1…N

Basket 1: {1,2,3}Pairs: {1,2} {1,3} {2,3}Basket 2: {1,2,4}Pairs: {1,2} {1,4} {2,4}

Buckets 1…B3 1 2

¡ Lasttime: Findingfrequentpairs¡ Furtherimprovement:PCY§ Pass1:

§ Countexactfrequencyofeachitem:§ Takepairsofitems{i,j},hashthemintoB bucketsandcountofthenumberofpairsthathashedtoeachbucket:

§ Pass2:§ Forapair{i,j}tobeacandidateforafrequentpair,itssingletonshavetobefrequentanditshastohashtoafrequentbucket!

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 12

Items 1…N

Basket 1: {1,2,3}Pairs: {1,2} {1,3} {2,3}Basket 2: {1,2,4}Pairs: {1,2} {1,4} {2,4}

Buckets 1…B3 1 2

Previous lecture: A-PrioriMain idea: CandidatesInstead of keeping a count of each pair, only keep a count of candidate pairs!

Today’s lecture: Find pairs of similar docsMain idea: Candidates-- Pass 1: Take documents and hash them to buckets such that documents that are similar hash to the same bucket-- Pass 2: Only compare documents that are candidates (i.e., they hashed to a same bucket)Benefits: Instead of O(N2) comparisons, we need O(N) comparisons to find similar documents

Page 7: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

7

¡ Goal: Findnear-neighborsinhigh-dim.space§ Weformallydefine“nearneighbors”aspointsthatarea“smalldistance”apart

¡ Foreachapplication,wefirstneedtodefinewhat“distance”means

¡ Today:Jaccard distance/similarity§ TheJaccard similarity oftwosets isthesizeoftheirintersectiondividedbythesizeoftheirunion:sim(C1,C2)=|C1ÇC2|/|C1ÈC2|

§ Jaccard distance: d(C1,C2)=1- |C1ÇC2|/|C1ÈC2|

14J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

3 in intersection8 in unionJaccard similarity= 3/8Jaccard distance = 5/8

Page 8: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

8

¡ Goal: Givenalargenumber(𝑵 inthemillionsorbillions)ofdocuments,find“nearduplicate”pairs

¡ Applications:§ Mirrorwebsites,orapproximatemirrors

§ Don’twanttoshowbothinsearchresults§ Similarnewsarticlesatmanynewssites

§ Clusterarticlesby“samestory”¡ Problems:

§ Manysmallpiecesofonedocumentcanappearoutoforderinanother

§ Toomanydocumentstocompareallpairs§ Documentsaresolargeorsomanythattheycannotfitinmainmemory

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 15

1. Shingling: Convertdocumentstosets

2. Min-Hashing: Convertlargesetstoshortsignatures,whilepreservingsimilarity

3. Locality-SensitiveHashing: Focusonpairsofsignatureslikelytobefromsimilardocuments

§ Candidatepairs!

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 16

Page 9: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

9

17

Docu-ment

The setof stringsof length kthat appearin the doc-ument

Signatures:short integervectors thatrepresent thesets, andreflect theirsimilarity

Locality-SensitiveHashing

Candidatepairs:those pairsof signaturesthat we needto test forsimilarity

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Step1: Shingling: Convertdocumentstosets

Docu-ment

The setof stringsof length kthat appearin the doc-ument

Page 10: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

10

¡ Step1: Shingling: Convertdocumentstosets

¡ Simpleapproaches:§ Document=setofwordsappearingindocument§ Document=setof“important”words§ Don’tworkwellforthisapplication.Why?

¡ Needtoaccountfororderingofwords!¡ Adifferentway:Shingles!

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 19

¡ Ak-shingle (ork-gram)foradocumentisasequenceofktokensthatappearsinthedoc§ Tokenscanbecharacters,wordsorsomethingelse,dependingontheapplication

§ Assumetokens=charactersforexamples

¡ Example: k=2;documentD1=abcabSetof2-shingles:S(D1)={ab,bc,ca}§ Option: Shinglesasabag(multiset),countabtwice:S’(D1)={ab,bc,ca, ab}

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 20

Page 11: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

11

¡ Tocompresslongshingles,wecanhash themto(say)4bytes

¡ Representadocumentbythesetofhashvaluesofitsk-shingles§ Idea: Twodocumentscould(rarely)appeartohaveshinglesincommon,wheninfactonlythehash-valueswereshared

¡ Example: k=2;documentD1= abcabSetof2-shingles:S(D1)={ab,bc,ca}Hashthesingles:h(D1)={1,5,7}

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 21

¡ DocumentD1isasetofitsk-shinglesC1=S(D1)¡ Equivalently,eachdocumentisa0/1vectorinthespaceofk-shingles§ Eachuniqueshingleisadimension§ Vectorsareverysparse

¡ AnaturalsimilaritymeasureistheJaccard similarity:

sim(D1,D2)=|C1ÇC2|/|C1ÈC2|

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 22

Page 12: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

12

¡ Documentsthathavelotsofshinglesincommonhavesimilartext,evenifthetextappearsindifferentorder

¡ Caveat: Youmustpickk largeenough,ormostdocumentswillhavemostshingles§ k =5isOKforshortdocuments§ k =10isbetterforlongdocuments

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 23

¡ Supposeweneedtofindnear-duplicatedocumentsamong𝑵 = 𝟏milliondocuments

¡ Naïvely,wewouldhavetocomputepairwiseJaccard similaritiesforeverypairofdocs§ 𝑵(𝑵 − 𝟏)/𝟐 ≈5*1011 comparisons§ At105 secs/dayand106 comparisons/sec,itwouldtake5days

¡ For𝑵 = 𝟏𝟎million,ittakesmorethanayear…

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 24

Page 13: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

13

Step2:Minhashing: Convertlargesets toshortsignatures,whilepreservingsimilarity

Docu-ment

The setof stringsof length kthat appearin the doc-ument

Signatures:short integervectors thatrepresent thesets, andreflect theirsimilarity

¡ Manysimilarityproblemscanbeformalizedasfindingsubsetsthathavesignificantintersection

¡ Encodesetsusing0/1(bit,boolean)vectors§ Onedimensionperelementintheuniversalset

¡ InterpretsetintersectionasbitwiseAND,andsetunionasbitwiseOR

¡ Example: C1 =10111;C2 =10011§ Sizeofintersection=3;sizeofunion=4,§ Jaccard similarity (notdistance)=3/4§ Distance:d(C1,C2)=1– (Jaccard similarity)=1/4

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 26

Page 14: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

14

¡ Rows =elements(shingles)¡ Columns =sets(documents)§ 1inrowe andcolumns ifandonlyife isamemberofs

§ ColumnsimilarityistheJaccardsimilarityofthecorrespondingsets(rowswithvalue1)

§ Typicalmatrixissparse!¡ Eachdocumentisacolumn:

§ Example: sim(C1 ,C2)=?§ Sizeofintersection=3;sizeofunion=6,Jaccard similarity(notdistance)=3/6

§ d(C1,C2)=1– (Jaccard similarity)=3/627J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

0101

0111

1001

1000

10101011

0111 Documents

Shin

gles

¡ Wemightnotreallyrepresentthedatabyabooleanmatrix.

¡ Sparsematricesareusuallybetterrepresentedbythelistofplaceswherethereisanon-zerovalue.

¡ Butthematrixpictureisconceptuallyuseful.

28

Page 15: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

15

1. Whenthesetsaresolargeorsomanythattheycannotfitinmainmemory.

2. Or,whentherearesomanysetsthatcomparingallpairsofsetstakestoomuchtime.

3. Orboth.

29

¡ Sofar:§ Documents® Setsofshingles§ Representsetsasboolean vectorsinamatrix

¡ Nextgoal:Findsimilarcolumnswhilecomputingsmallsignatures§ Similarityofcolumns==similarityofsignatures

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 30

Page 16: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

16

¡ NextGoal:Findsimilarcolumns,Smallsignatures¡ Naïveapproach:

§ 1)Signaturesofcolumns: smallsummariesofcolumns§ 2)Examinepairsofsignatures tofindsimilarcolumns

§ Essential: Similaritiesofsignaturesandcolumnsarerelated

§ 3)Optional: Checkthatcolumnswithsimilarsignaturesarereallysimilar

¡ Warnings:§ Comparingallpairsmaytaketoomuchtime:JobforLSH

§ Thesemethodscanproducefalsenegatives,andevenfalsepositives(iftheoptionalcheckisnotmade)

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 31

¡ Keyidea: “hash”eachcolumnC toasmallsignature h(C),suchthat:§ (1) h(C) issmallenoughthatthesignaturefitsinRAM§ (2) sim(C1,C2) isthesameasthe“similarity”ofsignaturesh(C1)andh(C2)

¡ Goal: Findahashfunctionh(·) suchthat:§ Ifsim(C1,C2) ishigh,thenwithhighprob.h(C1)=h(C2)§ Ifsim(C1,C2) islow,thenwithhighprob.h(C1)≠h(C2)

¡ Hashdocsintobuckets.Expectthat“most”pairsofnearduplicatedocshashintothesamebucket!

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 32

Page 17: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

17

¡ Goal: Findahashfunctionh(·) suchthat:§ ifsim(C1,C2) ishigh,thenwithhighprob.h(C1)=h(C2)§ ifsim(C1,C2) islow,thenwithhighprob.h(C1)≠h(C2)

¡ Clearly,thehashfunctiondependsonthesimilaritymetric:§ Notallsimilaritymetricshaveasuitablehashfunction

¡ ThereisasuitablehashfunctionfortheJaccard similarity: ItiscalledMin-Hashing

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 33

34

¡ Imaginetherowsoftheboolean matrixpermutedunderrandompermutationp

¡ Definea“hash”functionhp(C) =theindexofthefirst (inthepermutedorderp)rowinwhichcolumnC hasvalue1:

hp (C) = minp p(C)

¡ Useseveral(e.g.,100)independenthashfunctions(thatis,permutations)tocreateasignatureofacolumn

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Page 18: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

18

35

3

4

7

2

6

1

5

Signature matrix M

1212

5

7

6

3

1

2

4

1412

4

5

1

6

7

3

2

2121

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

2nd element of the permutation is the first to map to a 1

4th element of the permutation is the first to map to a 1

0101

0101

1010

1010

1010

1001

0101

Input matrix (Shingles x Documents) Permutation p

Note: Another (equivalent) way is to store row indexes: 1 5 1 5

2 3 1 36 4 6 4

¡ Choosearandompermutationp¡ Claim: Pr[hp(C1)=hp(C2)]=sim(C1,C2)¡ Why?

§ LetX beadoc(setofshingles),yÎ X isashingle§ Then: Pr[p(y)=min(p(X))]=1/|X|

§ ItisequallylikelythatanyyÎ X ismappedtothemin element

§ Lety bes.t. p(y)=min(p(C1ÈC2))§ Theneither: p(y)=min(p(C1))ifyÎ C1,or

p(y)=min(p(C2))ifyÎ C2§ Sotheprob.thatboth aretrueistheprob.y Î C1Ç C2§ Pr[min(p(C1))=min(p(C2))]=|C1ÇC2|/|C1ÈC2|=sim(C1,C2)

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 36

01

10

00

11

00

00

One of the twocols had to have1 at position y

Page 19: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

19

¡ GivencolsC1 andC2,rowsmaybeclassifiedas:C1 C2

A 1 1B 1 0C 0 1D 0 0

§ a =#rowsoftypeA,etc.¡ Note: sim(C1,C2)=a/(a+b+c)¡ Then: Pr[h(C1)=h(C2)]=Sim(C1,C2)

§ LookdownthecolsC1 andC2 untilweseea1§ Ifit’satype-A row,thenh(C1)=h(C2)Ifatype-B ortype-C row,thennot

37J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

38

¡ Weknow:Pr[hp(C1)=hp(C2)]=sim(C1,C2)¡ Nowgeneralizetomultiplehashfunctions

¡ Thesimilarityoftwosignaturesisthefractionofthehashfunctionsinwhichtheyagree

¡ Note: BecauseoftheMin-Hashproperty,thesimilarityofcolumnsisthesameastheexpectedsimilarityoftheirsignatures

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Page 20: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

20

39J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Similarities:1-3 2-4 1-2 3-4

Col/Col 0.75 0.75 0 0Sig/Sig 0.67 1.00 0 0

Signature matrix M

1212

5

7

6

3

1

2

4

1412

4

5

1

6

7

3

2

2121

0101

0101

1010

1010

1010

1001

0101

Input matrix (Shingles x Documents)

3

4

7

2

6

1

5

Permutation p

¡ PickK=100randompermutationsoftherows¡ Thinkofsig(C) asacolumnvector¡ sig(C)[i]= accordingtothei-th permutation,theindexofthefirstrowthathasa1incolumnCsig(C)[i] = min (pi(C))

¡ Note: Thesketch(signature)ofdocumentC issmall~𝟏𝟎𝟎 bytes!

¡ Weachievedourgoal! We“compressed”longbitvectorsintoshortsignatures

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 40

Page 21: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

21

¡ Permutingrowsevenonceisprohibitive¡ Rowhashing!

§ PickK=100 hashfunctionski§ Orderingunderki givesarandomrowpermutation!

¡ One-passimplementation§ ForeachcolumnC andhash-func.ki keepa“slot”forthemin-hashvalue

§ Initializeallsig(C)[i]=¥§ Scanrowslookingfor1s

§ Supposerowj has1incolumnC§ Thenforeachki :

§ Ifki(j)<sig(C)[i],thensig(C)[i]¬ ki(j)J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 41

How to pick a randomhash function h(x)?Universal hashing:ha,b(x)=((a·x+b) mod p) mod Nwhere:a,b … random integersp … prime number (p > N)

¡ Suppose1billionrows.

¡ Hardtopickarandompermutationfrom1…billion.

¡ Representingarandompermutationrequires1billionentries.

¡ Accessingrowsinpermutedorderleadstothrashing.

42

Page 22: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

22

¡ Agoodapproximationtopermutingrows:pick100(?)hashfunctions.

¡ Foreachcolumnc andeachhashfunctionhi,keepa“slot” M(i,c).

¡ Intent:M(i,c)willbecomethesmallestvalueofhi(r )forwhichcolumnc has1inrowr.§ I.e.,hi(r )givesorderofrowsfor i thpermuation.

43

Minhashing

InitializeM(i,c)to∞foralliandcfor eachrowrfor eachcolumnc

if chas1inrowrfor eachhashfunctionhi do

if hi(r)isasmallervaluethanM(i,c)then

M(i,c):=hi(r );

44

Minhashing

Page 23: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

23

45

Row C1 C21 1 02 0 13 1 14 1 05 0 1

h(x) = x mod 5g(x) = 2x+1 mod 5

Sig1 Sig2

Minhashing

h(1) = 1 1 -g(1) = 3 3 -h(2) = 2 1 2g(2) = 0 3 0

h(3) = 3 1 2g(3) = 2 2 0h(4) = 4 1 2g(4) = 4 2 0h(5) = 0 1 0g(5) = 1 2 0

¡ Often,dataisgivenbycolumn,notrow.§ E.g.,columns=documents,rows=shingles.

¡ Ifso,sortmatrixoncesoitisbyrow.

¡ Andalways computehi(r)onlyonceforeachrow.

46

Minhashing

Page 24: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

24

Step3:Locality-SensitiveHashing:Focusonpairsofsignatureslikelytobefromsimilardocuments

Docu-ment

The setof stringsof length kthat appearin the doc-ument

Signatures:short integervectors thatrepresent thesets, andreflect theirsimilarity

Locality-SensitiveHashing

Candidatepairs:those pairsof signaturesthat we needto test forsimilarity

¡ Supposewehave,inmainmemory,datarepresentingalargenumberofobjects.§ Maybetheobjectsthemselves.§ Maybesignaturesasinminhashing.

¡ Wewanttocompareeachtoeach,findingthosepairs thataresufficientlysimilar.

48

Page 25: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

25

¡ Whilethesignatures ofallcolumnsmayfitinmainmemory,comparingthesignaturesofallpairs ofcolumnsisquadratic inthenumberofcolumns.

¡ Example:106 columnsimplies5*1011 column-comparisons.

¡ At1microsecond/comparison:6days.

49

¡ Goal:FinddocumentswithJaccard similarityatleasts (forsomesimilaritythreshold,e.g., s=0.8)

¡ LSH– Generalidea: Useafunctionf(x,y) thattellswhetherx andy isacandidatepair: apairofelementswhosesimilaritymustbeevaluated

¡ ForMin-Hashmatrices:§ HashcolumnsofsignaturematrixM tomanybuckets§ Eachpairofdocumentsthathashesintothesamebucketisacandidatepair

50J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

1212

1412

2121

Page 26: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

26

¡ Pickasimilaritythresholds (0<s<1)

¡ Columnsx andy ofM areacandidatepair iftheirsignaturesagreeonatleastfractions oftheirrows:M (i,x)=M (i,y) foratleastfrac.s valuesofi§ Weexpectdocumentsx andy tohavethesame(Jaccard)similarityastheirsignatures

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 51

1212

1412

2121

¡ Bigidea: HashcolumnsofsignaturematrixM severaltimes

¡ Arrangethat(only)similarcolumns arelikelytohashtothesamebucket,withhighprobability

¡ Candidatepairsarethosethathashtothesamebucket

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 52

1212

1412

2121

Page 27: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

27

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 53

Signature matrix M

r rowsper band

b bands

Onesignature

1212

1412

2121

¡ DividematrixM intob bandsofr rows

¡ Foreachband,hashitsportionofeachcolumntoahashtablewithk buckets§ Makek aslargeaspossible

¡ Candidate columnpairsarethosethathashtothesamebucketfor≥ 1 band

¡ Tune b andr tocatchmostsimilarpairs,butfewnon-similarpairs

54J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Page 28: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

28

Matrix M

r rows b bands

BucketsColumns 2 and 6are probably identical (candidate pair)

Columns 6 and 7 aresurely different.

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 55

¡ Thereareenoughbuckets thatcolumnsareunlikelytohashtothesamebucketunlesstheyareidentical inaparticularband

¡ Hereafter,weassumethat“samebucket”means“identicalinthatband”

¡ Assumptionneededonlytosimplifyanalysis,notforcorrectnessofalgorithm

56J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Page 29: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

29

Assumethefollowingcase:¡ Suppose100,000columnsofM (100kdocs)¡ Signaturesof100integers(rows)¡ Therefore,signaturestake40Mb¡ Chooseb =20bandsofr =5integers/band

¡ Goal: Findpairsofdocumentsthatareatleasts =0.8 similar

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 57

1212

1412

2121

¡ Findpairsof³ s=0.8similarity,setb=20,r=5¡ Assume: sim(C1,C2)=0.8

§ Sincesim(C1,C2)³ s,wewantC1,C2 tobeacandidatepair:Wewantthemtohashtoat least1commonbucket(atleastonebandisidentical)

¡ ProbabilityC1,C2 identicalinoneparticularband:(0.8)5 =0.328

¡ ProbabilityC1,C2 arenot similarinallofthe20bands:(1-0.328)20 =0.00035§ i.e.,about1/3000thofthe80%-similarcolumnpairsarefalsenegatives (wemissthem)

§ Wewouldfind99.965%pairsoftrulysimilardocuments

58J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

1212

1412

2121

Page 30: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

30

¡ Findpairsof³ s=0.8similarity,setb=20,r=5¡ Assume: sim(C1,C2)=0.3

§ Sincesim(C1,C2)<s wewantC1,C2 tohashtoNOcommonbuckets (allbandsshouldbedifferent)

¡ ProbabilityC1,C2 identicalinoneparticularband:(0.3)5 =0.00243

¡ ProbabilityC1,C2 identicalinatleast1of20bands:1- (1- 0.00243)20 =0.0474§ Inotherwords,approximately4.74%pairsofdocswithsimilarity0.3endupbecomingcandidatepairs§ Theyarefalsepositivessincewewillhavetoexaminethem(theyarecandidatepairs)butthenitwillturnouttheirsimilarityisbelowthresholds

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 59

1212

1412

2121

¡ Pick:§ ThenumberofMin-Hashes(rowsofM)§ Thenumberofbandsb,and§ Thenumberofrowsr perbandtobalancefalsepositives/negatives

¡ Example: Ifwehadonly15bandsof5rows,thenumberoffalsepositiveswouldgodown,butthenumberoffalsenegativeswouldgoup

60J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

1212

1412

2121

Page 31: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

31

Similarity t =sim(C1, C2) of two sets

Probabilityof sharinga bucket

Sim

ilarit

y th

resh

old

sNo chance

if t < s

Probability = 1 if t > s

61J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 62

Remember:Probability ofequal hash-values= similarity

Similarity t =sim(C1, C2) of two sets

Probabilityof sharinga bucket

Page 32: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

32

¡ ColumnsC1 andC2 havesimilarityt¡ Pickanyband(r rows)§ Prob.thatallrowsinbandequal= tr

§ Prob.thatsomerowinbandunequal=1- tr

¡ Prob.thatnobandidentical=(1- tr)b

¡ Prob.thatatleast1bandidentical=1- (1- tr)b

63J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

t r

All rowsof a bandare equal

1 -

Some rowof a bandunequal

( )b

No bandsidentical

1 -

At leastone bandidentical

s ~ (1/b)1/r

64J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Similarity t=sim(C1, C2) of two sets

Probabilityof sharinga bucket

Page 33: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

33

¡ Similaritythresholds¡ Prob.thatatleast1bandisidentical:

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 65

s 1-(1-sr)b

.2 .006

.3 .047

.4 .186

.5 .470

.6 .802

.7 .975

.8 .9996

¡ Pickingr andb togetthebestS-curve§ 50hash-functions(r=5,b=10)

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 66

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Blue area: False Negative rateGreen area: False Positive rate

Similarity

Prob

. sha

ring

a bu

cket

Page 34: 12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4 ¡Many problems can be expressed as finding “similar” sets: §Find near-neighbors

5/12/17

34

¡ TuneM,b,r togetalmostallpairswithsimilarsignatures,buteliminatemostpairsthatdonothavesimilarsignatures

¡ Checkinmainmemorythatcandidatepairsreallydohavesimilarsignatures

¡ Optional: Inanotherpassthroughdata,checkthattheremainingcandidatepairsreallyrepresentsimilardocuments

67J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

¡ Shingling: Convertdocumentstosets§ WeusedhashingtoassigneachshingleanID

¡ Min-Hashing:Convertlargesetstoshortsignatures,whilepreservingsimilarity§ Weusedsimilaritypreservinghashing togeneratesignatureswithpropertyPr[hp(C1)=hp(C2)]=sim(C1,C2)

§ Weusedhashingtogetaroundgeneratingrandompermutations

¡ Locality-SensitiveHashing:Focusonpairsofsignatureslikelytobefromsimilardocuments§ Weusedhashingtofindcandidatepairs ofsimilarity³ s

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 68