12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4...

5/12/17

Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

http://www.mmds.org

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

High dim. data

Localitysensitivehashing

Clustering

Dimensionality

reduction

Graph data

PageRank,SimRank

NetworkAnalysis

SpamDetection

Infinite data

Filteringdata

streams

Webadvertising

Queriesonstreams

Machine learning

DecisionTrees

Perceptron,kNN

Recommendersystems

AssociationRules

Duplicatedocumentdetection

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 2

5/12/17

3J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

[Hays and Efros, SIGGRAPH 2007]

5/12/17

10 nearest neighbors from a collection of 20,000 imagesJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 5

10 nearest neighbors from a collection of 2 million imagesJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 6

5/12/17

¡ Manyproblemscanbeexpressedasfinding“similar”sets:§ Findnear-neighborsinhigh-dimensional space

¡ Examples:§ Pageswithsimilarwords

§ Forduplicatedetection,classificationbytopic§ Customerswhopurchasedsimilarproducts

§ Productswithsimilarcustomersets§ Imageswithsimilarfeatures

§ Userswhovisitedsimilarwebsites

¡ Given:Highdimensionaldatapoints𝒙𝟏, 𝒙𝟐, …§ Forexample: Imageisalongvectorofpixelcolors

1 2 10 2 10 1 0

→ [121021010]

¡ Andsomedistancefunction𝒅(𝒙𝟏, 𝒙𝟐)§ Whichquantifiesthe“distance”between𝒙𝟏 and𝒙𝟐

¡ Goal: Findallpairsofdatapoints(𝒙𝒊, 𝒙𝒋) thatarewithinsomedistancethreshold𝒅 𝒙𝒊, 𝒙𝒋 ≤ 𝒔

¡ Note: Naïvesolutionwouldtake𝑶 𝑵𝟐 Lwhere𝑵 isthenumberofdatapoints

¡ MAGIC:Thiscanbedonein𝑶 𝑵 !!How?J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 8

5/12/17

¡ Lasttime: Findingfrequentpairs

s 1…

Items 1…N

Count of pair {i,j} in the data

Naïve solution:Single pass but requires space quadratic in the number of items

s 1…

Items 1…K

Count of pair {i,j}in the data

A-Priori:First pass: Find frequent singletonsFor a pair to be a frequent pair candidate, its singletons have to be frequent!Second pass:Count only candidate pairs!

N … number of distinct itemsK … number of items with support ³ s

¡ Lasttime: Findingfrequentpairs¡ Furtherimprovement:PCY§ Pass1:

§ Countexactfrequencyofeachitem:§ Takepairsofitems{i,j},hashthemintoB bucketsandcountofthenumberofpairsthathashedtoeachbucket:

Items 1…N

Basket 1: {1,2,3}Pairs: {1,2} {1,3} {2,3}

Buckets 1…B2 1

5/12/17

§ Pass2:§ Forapair{i,j}tobeacandidateforafrequentpair,itssingletons{i},{j}havetobefrequentandthepairhastohashtoafrequentbucket!

Items 1…N

Basket 1: {1,2,3}Pairs: {1,2} {1,3} {2,3}Basket 2: {1,2,4}Pairs: {1,2} {1,4} {2,4}

Buckets 1…B3 1 2

§ Pass2:§ Forapair{i,j}tobeacandidateforafrequentpair,itssingletonshavetobefrequentanditshastohashtoafrequentbucket!

Items 1…N

Basket 1: {1,2,3}Pairs: {1,2} {1,3} {2,3}Basket 2: {1,2,4}Pairs: {1,2} {1,4} {2,4}

Buckets 1…B3 1 2

Previous lecture: A-PrioriMain idea: CandidatesInstead of keeping a count of each pair, only keep a count of candidate pairs!

Today’s lecture: Find pairs of similar docsMain idea: Candidates-- Pass 1: Take documents and hash them to buckets such that documents that are similar hash to the same bucket-- Pass 2: Only compare documents that are candidates (i.e., they hashed to a same bucket)Benefits: Instead of O(N2) comparisons, we need O(N) comparisons to find similar documents

5/12/17

¡ Goal: Findnear-neighborsinhigh-dim.space§ Weformallydefine“nearneighbors”aspointsthatarea“smalldistance”apart

¡ Foreachapplication,wefirstneedtodefinewhat“distance”means

¡ Today:Jaccard distance/similarity§ TheJaccard similarity oftwosets isthesizeoftheirintersectiondividedbythesizeoftheirunion:sim(C1,C2)=|C1ÇC2|/|C1ÈC2|

§ Jaccard distance: d(C1,C2)=1- |C1ÇC2|/|C1ÈC2|

3 in intersection8 in unionJaccard similarity= 3/8Jaccard distance = 5/8

5/12/17

¡ Goal: Givenalargenumber(𝑵 inthemillionsorbillions)ofdocuments,find“nearduplicate”pairs

¡ Applications:§ Mirrorwebsites,orapproximatemirrors

§ Don’twanttoshowbothinsearchresults§ Similarnewsarticlesatmanynewssites

§ Clusterarticlesby“samestory”¡ Problems:

§ Manysmallpiecesofonedocumentcanappearoutoforderinanother

§ Toomanydocumentstocompareallpairs§ Documentsaresolargeorsomanythattheycannotfitinmainmemory

1. Shingling: Convertdocumentstosets

2. Min-Hashing: Convertlargesetstoshortsignatures,whilepreservingsimilarity

3. Locality-SensitiveHashing: Focusonpairsofsignatureslikelytobefromsimilardocuments

§ Candidatepairs!

5/12/17

Docu-ment

The setof stringsof length kthat appearin the doc-ument

Signatures:short integervectors thatrepresent thesets, andreflect theirsimilarity

Locality-SensitiveHashing

Candidatepairs:those pairsof signaturesthat we needto test forsimilarity

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

Step1: Shingling: Convertdocumentstosets

Docu-ment

5/12/17

¡ Step1: Shingling: Convertdocumentstosets

¡ Simpleapproaches:§ Document=setofwordsappearingindocument§ Document=setof“important”words§ Don’tworkwellforthisapplication.Why?

¡ Needtoaccountfororderingofwords!¡ Adifferentway:Shingles!

¡ Ak-shingle (ork-gram)foradocumentisasequenceofktokensthatappearsinthedoc§ Tokenscanbecharacters,wordsorsomethingelse,dependingontheapplication

§ Assumetokens=charactersforexamples

¡ Example: k=2;documentD1=abcabSetof2-shingles:S(D1)={ab,bc,ca}§ Option: Shinglesasabag(multiset),countabtwice:S’(D1)={ab,bc,ca, ab}

5/12/17

¡ Tocompresslongshingles,wecanhash themto(say)4bytes

¡ Representadocumentbythesetofhashvaluesofitsk-shingles§ Idea: Twodocumentscould(rarely)appeartohaveshinglesincommon,wheninfactonlythehash-valueswereshared

¡ Example: k=2;documentD1= abcabSetof2-shingles:S(D1)={ab,bc,ca}Hashthesingles:h(D1)={1,5,7}

¡ DocumentD1isasetofitsk-shinglesC1=S(D1)¡ Equivalently,eachdocumentisa0/1vectorinthespaceofk-shingles§ Eachuniqueshingleisadimension§ Vectorsareverysparse

¡ AnaturalsimilaritymeasureistheJaccard similarity:

sim(D1,D2)=|C1ÇC2|/|C1ÈC2|

5/12/17

¡ Documentsthathavelotsofshinglesincommonhavesimilartext,evenifthetextappearsindifferentorder

¡ Caveat: Youmustpickk largeenough,ormostdocumentswillhavemostshingles§ k =5isOKforshortdocuments§ k =10isbetterforlongdocuments

¡ Supposeweneedtofindnear-duplicatedocumentsamong𝑵 = 𝟏milliondocuments

¡ Naïvely,wewouldhavetocomputepairwiseJaccard similaritiesforeverypairofdocs§ 𝑵(𝑵 − 𝟏)/𝟐 ≈5*1011 comparisons§ At105 secs/dayand106 comparisons/sec,itwouldtake5days

¡ For𝑵 = 𝟏𝟎million,ittakesmorethanayear…

5/12/17

Step2:Minhashing: Convertlargesets toshortsignatures,whilepreservingsimilarity

Docu-ment

¡ Manysimilarityproblemscanbeformalizedasfindingsubsetsthathavesignificantintersection

¡ Encodesetsusing0/1(bit,boolean)vectors§ Onedimensionperelementintheuniversalset

¡ InterpretsetintersectionasbitwiseAND,andsetunionasbitwiseOR

¡ Example: C1 =10111;C2 =10011§ Sizeofintersection=3;sizeofunion=4,§ Jaccard similarity (notdistance)=3/4§ Distance:d(C1,C2)=1– (Jaccard similarity)=1/4

5/12/17

¡ Rows =elements(shingles)¡ Columns =sets(documents)§ 1inrowe andcolumns ifandonlyife isamemberofs

§ ColumnsimilarityistheJaccardsimilarityofthecorrespondingsets(rowswithvalue1)

§ Typicalmatrixissparse!¡ Eachdocumentisacolumn:

§ Example: sim(C1 ,C2)=?§ Sizeofintersection=3;sizeofunion=6,Jaccard similarity(notdistance)=3/6

§ d(C1,C2)=1– (Jaccard similarity)=3/627J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org

10101011

0111 Documents

¡ Wemightnotreallyrepresentthedatabyabooleanmatrix.

¡ Sparsematricesareusuallybetterrepresentedbythelistofplaceswherethereisanon-zerovalue.

¡ Butthematrixpictureisconceptuallyuseful.

5/12/17

1. Whenthesetsaresolargeorsomanythattheycannotfitinmainmemory.

2. Or,whentherearesomanysetsthatcomparingallpairsofsetstakestoomuchtime.

3. Orboth.

¡ Sofar:§ Documents® Setsofshingles§ Representsetsasboolean vectorsinamatrix

¡ Nextgoal:Findsimilarcolumnswhilecomputingsmallsignatures§ Similarityofcolumns==similarityofsignatures

5/12/17

¡ NextGoal:Findsimilarcolumns,Smallsignatures¡ Naïveapproach:

§ 1)Signaturesofcolumns: smallsummariesofcolumns§ 2)Examinepairsofsignatures tofindsimilarcolumns

§ Essential: Similaritiesofsignaturesandcolumnsarerelated

§ 3)Optional: Checkthatcolumnswithsimilarsignaturesarereallysimilar

¡ Warnings:§ Comparingallpairsmaytaketoomuchtime:JobforLSH

§ Thesemethodscanproducefalsenegatives,andevenfalsepositives(iftheoptionalcheckisnotmade)

¡ Keyidea: “hash”eachcolumnC toasmallsignature h(C),suchthat:§ (1) h(C) issmallenoughthatthesignaturefitsinRAM§ (2) sim(C1,C2) isthesameasthe“similarity”ofsignaturesh(C1)andh(C2)

¡ Goal: Findahashfunctionh(·) suchthat:§ Ifsim(C1,C2) ishigh,thenwithhighprob.h(C1)=h(C2)§ Ifsim(C1,C2) islow,thenwithhighprob.h(C1)≠h(C2)

¡ Hashdocsintobuckets.Expectthat“most”pairsofnearduplicatedocshashintothesamebucket!

5/12/17

¡ Goal: Findahashfunctionh(·) suchthat:§ ifsim(C1,C2) ishigh,thenwithhighprob.h(C1)=h(C2)§ ifsim(C1,C2) islow,thenwithhighprob.h(C1)≠h(C2)

¡ Clearly,thehashfunctiondependsonthesimilaritymetric:§ Notallsimilaritymetricshaveasuitablehashfunction

¡ ThereisasuitablehashfunctionfortheJaccard similarity: ItiscalledMin-Hashing

¡ Imaginetherowsoftheboolean matrixpermutedunderrandompermutationp

¡ Definea“hash”functionhp(C) =theindexofthefirst (inthepermutedorderp)rowinwhichcolumnC hasvalue1:

hp (C) = minp p(C)

¡ Useseveral(e.g.,100)independenthashfunctions(thatis,permutations)tocreateasignatureofacolumn

5/12/17

Signature matrix M

2nd element of the permutation is the first to map to a 1

4th element of the permutation is the first to map to a 1

Input matrix (Shingles x Documents) Permutation p

Note: Another (equivalent) way is to store row indexes: 1 5 1 5

2 3 1 36 4 6 4

¡ Choosearandompermutationp¡ Claim: Pr[hp(C1)=hp(C2)]=sim(C1,C2)¡ Why?

§ LetX beadoc(setofshingles),yÎ X isashingle§ Then: Pr[p(y)=min(p(X))]=1/|X|

§ ItisequallylikelythatanyyÎ X ismappedtothemin element

§ Lety bes.t. p(y)=min(p(C1ÈC2))§ Theneither: p(y)=min(p(C1))ifyÎ C1,or

p(y)=min(p(C2))ifyÎ C2§ Sotheprob.thatboth aretrueistheprob.y Î C1Ç C2§ Pr[min(p(C1))=min(p(C2))]=|C1ÇC2|/|C1ÈC2|=sim(C1,C2)

One of the twocols had to have1 at position y

5/12/17

¡ GivencolsC1 andC2,rowsmaybeclassifiedas:C1 C2

A 1 1B 1 0C 0 1D 0 0

§ a =#rowsoftypeA,etc.¡ Note: sim(C1,C2)=a/(a+b+c)¡ Then: Pr[h(C1)=h(C2)]=Sim(C1,C2)

§ LookdownthecolsC1 andC2 untilweseea1§ Ifit’satype-A row,thenh(C1)=h(C2)Ifatype-B ortype-C row,thennot

¡ Weknow:Pr[hp(C1)=hp(C2)]=sim(C1,C2)¡ Nowgeneralizetomultiplehashfunctions

¡ Thesimilarityoftwosignaturesisthefractionofthehashfunctionsinwhichtheyagree

¡ Note: BecauseoftheMin-Hashproperty,thesimilarityofcolumnsisthesameastheexpectedsimilarityoftheirsignatures

5/12/17

Similarities:1-3 2-4 1-2 3-4

Col/Col 0.75 0.75 0 0Sig/Sig 0.67 1.00 0 0

Signature matrix M

Input matrix (Shingles x Documents)

Permutation p

¡ PickK=100randompermutationsoftherows¡ Thinkofsig(C) asacolumnvector¡ sig(C)[i]= accordingtothei-th permutation,theindexofthefirstrowthathasa1incolumnCsig(C)[i] = min (pi(C))

¡ Note: Thesketch(signature)ofdocumentC issmall~𝟏𝟎𝟎 bytes!

¡ Weachievedourgoal! We“compressed”longbitvectorsintoshortsignatures

5/12/17

¡ Permutingrowsevenonceisprohibitive¡ Rowhashing!

§ PickK=100 hashfunctionski§ Orderingunderki givesarandomrowpermutation!

¡ One-passimplementation§ ForeachcolumnC andhash-func.ki keepa“slot”forthemin-hashvalue

§ Initializeallsig(C)[i]=¥§ Scanrowslookingfor1s

§ Supposerowj has1incolumnC§ Thenforeachki :

§ Ifki(j)<sig(C)[i],thensig(C)[i]¬ ki(j)J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 41

How to pick a randomhash function h(x)?Universal hashing:ha,b(x)=((a·x+b) mod p) mod Nwhere:a,b … random integersp … prime number (p > N)

¡ Suppose1billionrows.

¡ Hardtopickarandompermutationfrom1…billion.

¡ Representingarandompermutationrequires1billionentries.

¡ Accessingrowsinpermutedorderleadstothrashing.

5/12/17

¡ Agoodapproximationtopermutingrows:pick100(?)hashfunctions.

¡ Foreachcolumnc andeachhashfunctionhi,keepa“slot” M(i,c).

¡ Intent:M(i,c)willbecomethesmallestvalueofhi(r )forwhichcolumnc has1inrowr.§ I.e.,hi(r )givesorderofrowsfor i thpermuation.

Minhashing

InitializeM(i,c)to∞foralliandcfor eachrowrfor eachcolumnc

if chas1inrowrfor eachhashfunctionhi do

if hi(r)isasmallervaluethanM(i,c)then

M(i,c):=hi(r );

Minhashing

5/12/17

Row C1 C21 1 02 0 13 1 14 1 05 0 1

h(x) = x mod 5g(x) = 2x+1 mod 5

Sig1 Sig2

Minhashing

h(1) = 1 1 -g(1) = 3 3 -h(2) = 2 1 2g(2) = 0 3 0

h(3) = 3 1 2g(3) = 2 2 0h(4) = 4 1 2g(4) = 4 2 0h(5) = 0 1 0g(5) = 1 2 0

¡ Often,dataisgivenbycolumn,notrow.§ E.g.,columns=documents,rows=shingles.

¡ Ifso,sortmatrixoncesoitisbyrow.

¡ Andalways computehi(r)onlyonceforeachrow.

Minhashing

5/12/17

Step3:Locality-SensitiveHashing:Focusonpairsofsignatureslikelytobefromsimilardocuments

Docu-ment

Locality-SensitiveHashing

Candidatepairs:those pairsof signaturesthat we needto test forsimilarity

¡ Supposewehave,inmainmemory,datarepresentingalargenumberofobjects.§ Maybetheobjectsthemselves.§ Maybesignaturesasinminhashing.

¡ Wewanttocompareeachtoeach,findingthosepairs thataresufficientlysimilar.

5/12/17

¡ Whilethesignatures ofallcolumnsmayfitinmainmemory,comparingthesignaturesofallpairs ofcolumnsisquadratic inthenumberofcolumns.

¡ Example:106 columnsimplies5*1011 column-comparisons.

¡ At1microsecond/comparison:6days.

¡ Goal:FinddocumentswithJaccard similarityatleasts (forsomesimilaritythreshold,e.g., s=0.8)

¡ LSH– Generalidea: Useafunctionf(x,y) thattellswhetherx andy isacandidatepair: apairofelementswhosesimilaritymustbeevaluated

¡ ForMin-Hashmatrices:§ HashcolumnsofsignaturematrixM tomanybuckets§ Eachpairofdocumentsthathashesintothesamebucketisacandidatepair

5/12/17

¡ Pickasimilaritythresholds (0<s<1)

¡ Columnsx andy ofM areacandidatepair iftheirsignaturesagreeonatleastfractions oftheirrows:M (i,x)=M (i,y) foratleastfrac.s valuesofi§ Weexpectdocumentsx andy tohavethesame(Jaccard)similarityastheirsignatures

¡ Bigidea: HashcolumnsofsignaturematrixM severaltimes

¡ Arrangethat(only)similarcolumns arelikelytohashtothesamebucket,withhighprobability

¡ Candidatepairsarethosethathashtothesamebucket

5/12/17

Signature matrix M

r rowsper band

b bands

Onesignature

¡ DividematrixM intob bandsofr rows

¡ Foreachband,hashitsportionofeachcolumntoahashtablewithk buckets§ Makek aslargeaspossible

¡ Candidate columnpairsarethosethathashtothesamebucketfor≥ 1 band

¡ Tune b andr tocatchmostsimilarpairs,butfewnon-similarpairs

5/12/17

Matrix M

r rows b bands

BucketsColumns 2 and 6are probably identical (candidate pair)

Columns 6 and 7 aresurely different.

¡ Thereareenoughbuckets thatcolumnsareunlikelytohashtothesamebucketunlesstheyareidentical inaparticularband

¡ Hereafter,weassumethat“samebucket”means“identicalinthatband”

¡ Assumptionneededonlytosimplifyanalysis,notforcorrectnessofalgorithm

5/12/17

Assumethefollowingcase:¡ Suppose100,000columnsofM (100kdocs)¡ Signaturesof100integers(rows)¡ Therefore,signaturestake40Mb¡ Chooseb =20bandsofr =5integers/band

¡ Goal: Findpairsofdocumentsthatareatleasts =0.8 similar

¡ Findpairsof³ s=0.8similarity,setb=20,r=5¡ Assume: sim(C1,C2)=0.8

§ Sincesim(C1,C2)³ s,wewantC1,C2 tobeacandidatepair:Wewantthemtohashtoat least1commonbucket(atleastonebandisidentical)

¡ ProbabilityC1,C2 identicalinoneparticularband:(0.8)5 =0.328

¡ ProbabilityC1,C2 arenot similarinallofthe20bands:(1-0.328)20 =0.00035§ i.e.,about1/3000thofthe80%-similarcolumnpairsarefalsenegatives (wemissthem)

§ Wewouldfind99.965%pairsoftrulysimilardocuments

5/12/17

¡ Findpairsof³ s=0.8similarity,setb=20,r=5¡ Assume: sim(C1,C2)=0.3

§ Sincesim(C1,C2)<s wewantC1,C2 tohashtoNOcommonbuckets (allbandsshouldbedifferent)

¡ ProbabilityC1,C2 identicalinoneparticularband:(0.3)5 =0.00243

¡ ProbabilityC1,C2 identicalinatleast1of20bands:1- (1- 0.00243)20 =0.0474§ Inotherwords,approximately4.74%pairsofdocswithsimilarity0.3endupbecomingcandidatepairs§ Theyarefalsepositivessincewewillhavetoexaminethem(theyarecandidatepairs)butthenitwillturnouttheirsimilarityisbelowthresholds

¡ Pick:§ ThenumberofMin-Hashes(rowsofM)§ Thenumberofbandsb,and§ Thenumberofrowsr perbandtobalancefalsepositives/negatives

¡ Example: Ifwehadonly15bandsof5rows,thenumberoffalsepositiveswouldgodown,butthenumberoffalsenegativeswouldgoup

5/12/17

Similarity t =sim(C1, C2) of two sets

Probabilityof sharinga bucket

ilarit

sNo chance

if t < s

Probability = 1 if t > s

Remember:Probability ofequal hash-values= similarity

Similarity t =sim(C1, C2) of two sets

5/12/17

¡ ColumnsC1 andC2 havesimilarityt¡ Pickanyband(r rows)§ Prob.thatallrowsinbandequal= tr

§ Prob.thatsomerowinbandunequal=1- tr

¡ Prob.thatnobandidentical=(1- tr)b

¡ Prob.thatatleast1bandidentical=1- (1- tr)b

All rowsof a bandare equal

Some rowof a bandunequal

No bandsidentical

At leastone bandidentical

s ~ (1/b)1/r

Similarity t=sim(C1, C2) of two sets

5/12/17

¡ Similaritythresholds¡ Prob.thatatleast1bandisidentical:

s 1-(1-sr)b

.2 .006

.3 .047

.4 .186

.5 .470

.6 .802

.7 .975

.8 .9996

¡ Pickingr andb togetthebestS-curve§ 50hash-functions(r=5,b=10)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

Blue area: False Negative rateGreen area: False Positive rate

Similarity

5/12/17

¡ TuneM,b,r togetalmostallpairswithsimilarsignatures,buteliminatemostpairsthatdonothavesimilarsignatures

¡ Checkinmainmemorythatcandidatepairsreallydohavesimilarsignatures

¡ Optional: Inanotherpassthroughdata,checkthattheremainingcandidatepairsreallyrepresentsimilardocuments

¡ Shingling: Convertdocumentstosets§ WeusedhashingtoassigneachshingleanID

¡ Min-Hashing:Convertlargesetstoshortsignatures,whilepreservingsimilarity§ Weusedsimilaritypreservinghashing togeneratesignatureswithpropertyPr[hp(C1)=hp(C2)]=sim(C1,C2)

§ Weusedhashingtogetaroundgeneratingrandompermutations

¡ Locality-SensitiveHashing:Focusonpairsofsignatureslikelytobefromsimilardocuments§ Weusedhashingtofindcandidatepairs ofsimilarity³ s

12 Finding Similar Items - SJTUyshen/courses/BigData/12 Finding... · 2017. 5. 12. · 5/12/17 4...

Documents

2.5 Similar Figures - Jackson School District€¦ · Section 2.5 Similar Figures 73 EXAMPLE 2 Finding an Unknown Measure in Similar Figures The triangles are similar. Find x. Because

Finding Similar Mobile Consumers with a Privacy-Friendly ...pages.stern.nyu.edu/~fprovost/Papers/PMM_ISR_2015.pdf · Provost, Martens, and Murray: Finding Similar Mobile Consumers

1 Low-Support, High-Correlation Finding rare, but very similar items

Finding Consort Ch 12

Center for Biological Sequence Analysis Database Searching Using alignment algorithms for finding similar sequences

Finding Unknown Measures 5.3 in Similar Figures 7/05/g7_05...Corresponding side lengths are 8 m6 m 9 m x ... Section 5.3 Finding Unknown Measures in Similar Figures 211 Copy and complete

Finding Similar Exercises in Online Education Systemsstaff.ustc.edu.cn/~cheneh/paper_pdf/2018/Qi-Liu-KDD.pdf · Finding Similar Exercises in Online Education Systems Qi Liu1, Zai

Module 3 - Lesson 11 - More on Similar Triangles 8/Module 3...Module 3 Lesson 11 More on Similar Triangles 1 Students practice finding lengths of corresponding sides of similar triangles,

Finding Similar Files in Large Document Repositories

Bag of Visual Words for Finding Similar Images

Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition

Finding similar neighborhoods across cities by mining ... · Finding similar neighborhoods across cities by mining human urban activity ... problem in details inChapter , the rest

1 Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing

What is a Good Nearest Neighbors Algorithm for Finding Similar Patches in Images?

Finding Similar Items

Finding Similar Documents Using Nearest Neighbors · 1 1 Finding Similar Documents Using Nearest Neighbors Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of

Finding Similar Movies: Dataset, Tools, and Methods

Chapter 3 - Finding Similar Items

Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining

Scale Factor of Similar Figures Finding the Scale Factor Finding Dimensions with Scale Factor Irma Crespo…