Upload
francois-garillot
View
14.542
Download
3
Embed Size (px)
Citation preview
A GENTLE INTRODUCTION TO APACHE SPARK AND LOCALITY-SENSITIVE
HASHING1
LOCALITY-SENSITIVE HASHING
▸ A story : Why LSH▸ How it works & hash families
▸ LSH distribution▸ Beware : WIP
3
SPARK TENETS
▸ broadcast variables▸ per-partition commands▸ shuffle sparsely
4
5
6
7
SEGMENTATION
▸ small sample: 289421 users▸ larger sample : 5684403 users
46K websites, ultimately users4 personal laptops, 4 provided laptops
8
K-MEANS COMPLEXITY
Find with the 'elbow method' on within-cluster sum of squares. Then
9
EM - GAUSSIAN MIXTURE
With dimensions, mixtures,
10
LOCALITY-SENSITIVE HASHING FUNCTIONSA family H of hashing functions is -sensitive if:
▸ if then ▸ if then
11
DISTANCES ! (THOSE AND MANY OTHER)
▸ Hamming distance : where is arandomly chosen index
▸ Jaccard :
▸ Cosine distance:
12
EARTH MOVER'S DISTANCE
13
EARTH MOVER'S DISTANCE
Find optimal F minimizing:
Then:
14
A WORD ON MODULARITY
LSH for EMD introduced by Charikar in the Simhash paper (2002).
Yet no place to plug your LSH family in implementation (e.g. scikit, mrsqueeze) !
15
LSH AMPLIFICATION : CONCATENATIONS AND PARALLEL
▸ basic LSH:
▸ AND (series) construction: ▸ OR (parallel) construction :
16
17
BASIC LSH val hashCollection = records.map(s => (getId(s), s)). mapValues(s => getHash(s, hashers)) val subArray = hashCollection.flatMap { case (recordId, hash) => hash.grouped(hashLength / numberBands).zipWithIndex.map{ case (band, bandIndex) => (bandIndex, (band, sentenceId)) } }
18
LOOKUPdef findCandidates(record: Iterable[String], hashers: Array[Int => Int], mBands: BandType) = { val hash = getHash(record, hashers) val subArrays = partitionArray(hash).zipWithIndex
subArrays.flatMap { case (band, bandIndex) => val hashedBucket = mBands.lookup(bandIndex). headOption. flatMap{_.get(band)} hashedBucket }.flatten.toSet}
19
getHash(record,hashers)
DISTRIBUTE RANDOM SEEDS, NOT PERMUTATION FUNCTIONSrecords.mapPartitions { iter => val rng = new Scala.util.random() iter.map(x => hashers.flatMap{h => getHashFunction(rng, h)(x)})}
20
AND YET, OOM
21
BASIC LSHWITH A 2-STABLE GAUSSIAN DISTRIBUTION
With data points, choose and , to solve the problem
22
WEB LOGS ARE SPARSE
Input : hits per user, over 6 months, 2x50-ish integers/user (4GB)
Output of length 1000 integers per user : 10 (parallel) bands, 100 (concatenated) hashes
64-bit integers : 40 GB
Yet !23
ENTROPY LSH (PANIGRAPHI 2006)REPLACE TABLES BY OFFSETS
, , chosen randomly from the surfaceof , the sphere of radius centered at
24
ENTROPY LSHWITH A 2-STABLE GAUSSIAN DISTRIBUTION
With data points, choose and
, to solve the problem with asfew as hash tables
25
BUT ... NETWORK COSTS
▸ Basic LSH : look up buckets,
▸ Entropy LSH : search for offsets
26
LAYERED LSH (BAHMANI ET AL. 2012)
Output of your LSH family is in , with e.g. a cosine norm.
For closer points, the chance of hashes hashing to the same bucket is high!
27
LAYERED LSH
Have an LSH family for your norm on
Likely that for all offsets
28
LAYERED LSH
Output of hash generation is (GH(p), (H(p), p)) for all p.
In Spark, group, or custom partitioner for (H(p), p) RDD.
Network cost :
29
PERFORMANCE
30
FUTURE WORKHAVE A (BIG) WEBLOG ?
▸ Weve▸ Yandex
31
FUTURE WORKLOCALITY-SENSITIVE HASHING FORESTS !
32
RELEASEgithub.com/huitseeker/spark-lsh
1 SEPT 2015
33