33
A GENTLE INTRODUCTION TO APACHE SPARK AND LOCALITY-SENSITIVE HASHING 1

A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

Embed Size (px)

Citation preview

Page 1: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

A GENTLE INTRODUCTION TO APACHE SPARK AND LOCALITY-SENSITIVE

HASHING1

Page 2: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

FRANCOIS GARILLOT(FORMERLY) TYPESAFE

[email protected]

@huitseeker

2

Page 3: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

LOCALITY-SENSITIVE HASHING

▸ A story : Why LSH▸ How it works & hash families

▸ LSH distribution▸ Beware : WIP

3

Page 4: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

SPARK TENETS

▸ broadcast variables▸ per-partition commands▸ shuffle sparsely

4

Page 5: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

5

Page 6: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

6

Page 7: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

7

Page 8: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

SEGMENTATION

▸ small sample: 289421 users▸ larger sample : 5684403 users

46K websites, ultimately users4 personal laptops, 4 provided laptops

8

Page 9: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

K-MEANS COMPLEXITY

Find with the 'elbow method' on within-cluster sum of squares. Then

9

Page 10: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

EM - GAUSSIAN MIXTURE

With dimensions, mixtures,

10

Page 11: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

LOCALITY-SENSITIVE HASHING FUNCTIONSA family H of hashing functions is -sensitive if:

▸ if then ▸ if then

11

Page 12: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

DISTANCES ! (THOSE AND MANY OTHER)

▸ Hamming distance : where is arandomly chosen index

▸ Jaccard :

▸ Cosine distance:

12

Page 13: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

EARTH MOVER'S DISTANCE

13

Page 14: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

EARTH MOVER'S DISTANCE

Find optimal F minimizing:

Then:

14

Page 15: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

A WORD ON MODULARITY

LSH for EMD introduced by Charikar in the Simhash paper (2002).

Yet no place to plug your LSH family in implementation (e.g. scikit, mrsqueeze) !

15

Page 16: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

LSH AMPLIFICATION : CONCATENATIONS AND PARALLEL

▸ basic LSH:

▸ AND (series) construction: ▸ OR (parallel) construction :

16

Page 17: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

17

Page 18: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

BASIC LSH val hashCollection = records.map(s => (getId(s), s)). mapValues(s => getHash(s, hashers)) val subArray = hashCollection.flatMap { case (recordId, hash) => hash.grouped(hashLength / numberBands).zipWithIndex.map{ case (band, bandIndex) => (bandIndex, (band, sentenceId)) } }

18

Page 19: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

LOOKUPdef findCandidates(record: Iterable[String], hashers: Array[Int => Int], mBands: BandType) = { val hash = getHash(record, hashers) val subArrays = partitionArray(hash).zipWithIndex

subArrays.flatMap { case (band, bandIndex) => val hashedBucket = mBands.lookup(bandIndex). headOption. flatMap{_.get(band)} hashedBucket }.flatten.toSet}

19

Page 20: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

getHash(record,hashers)

DISTRIBUTE RANDOM SEEDS, NOT PERMUTATION FUNCTIONSrecords.mapPartitions { iter => val rng = new Scala.util.random() iter.map(x => hashers.flatMap{h => getHashFunction(rng, h)(x)})}

20

Page 21: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

AND YET, OOM

21

Page 22: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

BASIC LSHWITH A 2-STABLE GAUSSIAN DISTRIBUTION

With data points, choose and , to solve the problem

22

Page 23: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

WEB LOGS ARE SPARSE

Input : hits per user, over 6 months, 2x50-ish integers/user (4GB)

Output of length 1000 integers per user : 10 (parallel) bands, 100 (concatenated) hashes

64-bit integers : 40 GB

Yet !23

Page 24: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

ENTROPY LSH (PANIGRAPHI 2006)REPLACE TABLES BY OFFSETS

, , chosen randomly from the surfaceof , the sphere of radius centered at

24

Page 25: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

ENTROPY LSHWITH A 2-STABLE GAUSSIAN DISTRIBUTION

With data points, choose and

, to solve the problem with asfew as hash tables

25

Page 26: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

BUT ... NETWORK COSTS

▸ Basic LSH : look up buckets,

▸ Entropy LSH : search for offsets

26

Page 27: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

LAYERED LSH (BAHMANI ET AL. 2012)

Output of your LSH family is in , with e.g. a cosine norm.

For closer points, the chance of hashes hashing to the same bucket is high!

27

Page 28: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

LAYERED LSH

Have an LSH family for your norm on

Likely that for all offsets

28

Page 29: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

LAYERED LSH

Output of hash generation is (GH(p), (H(p), p)) for all p.

In Spark, group, or custom partitioner for (H(p), p) RDD.

Network cost :

29

Page 30: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

PERFORMANCE

30

Page 31: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

FUTURE WORKHAVE A (BIG) WEBLOG ?

▸ Weve▸ Yandex

31

Page 32: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

FUTURE WORKLOCALITY-SENSITIVE HASHING FORESTS !

32

Page 33: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

RELEASEgithub.com/huitseeker/spark-lsh

1 SEPT 2015

33