16
Advanced Topics in Artificial Intelligence Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyky, Rajeev Motwaniz Presenter Maruf Aytekin PhD Student Computer Engineering Department Bahcesehir University Apr 21, 2015

Similarity Search in High Dimensions via Hashing

Embed Size (px)

Citation preview

Page 1: Similarity Search in High Dimensions via Hashing

Advanced Topics in Artificial Intelligence

Similarity Search in High Dimensions via Hashing

Aristides Gionis, Piotr Indyky, Rajeev Motwaniz

Presenter

Maruf AytekinPhD Student

Computer Engineering DepartmentBahcesehir University

Apr 21, 2015

Page 2: Similarity Search in High Dimensions via Hashing

Outline• LSH • Locality-Sensitive Functions • Banding Technique • LSH Families for Cosine • Applications of LSH • Conclusion

Page 3: Similarity Search in High Dimensions via Hashing

LSHOne general approach to LSH

• “Hash” items several times, in such a way that similar items are more likely to be hashed to the same bucket than dissimilar items are.

• We then consider any pair that hashed to the same bucket for any of the hashings to be a candidate pair.

• We check only the candidate pairs for similarity.

Page 4: Similarity Search in High Dimensions via Hashing

LSH• Most of the dissimilar pairs will never hash to the same

bucket, and therefore will never be checked. • Those dissimilar pairs that do hash to the same bucket are

false positives: a small fraction of all pairs. • We also hope that most of the truly similar pairs will hash to

the same bucket under at least one of the hash functions. • Those that do not are false negatives; only a small fraction of

the truly similar pairs.

Page 5: Similarity Search in High Dimensions via Hashing

Locality-Sensitive FunctionsIn many cases, the function f will “hash” items, and the

decision will be based on whether or not the result is equal.

• f(x) = f(y) to mean that f(x,y) is “yes; make x and y a

candidate pair.”

• f(x) ≠ f(y) to mean “do not make x and y a candidate pair.”

A collection of functions of this form will be called a family of

functions.

Page 6: Similarity Search in High Dimensions via Hashing

Locality-Sensitive FunctionsLet d1 < d2 be two distances according to some distance

measure d. A family F of functions is said to be (d1, d2, p1, p2)-sensitive if for every f in F:

1. If d(x, y) ≤ d1, then the probability that f(x) = f(y) is at

least p1.

2. If d(x, y) ≥ d2, then the probability that f(x) = f(y) is at

most p2.

Page 7: Similarity Search in High Dimensions via Hashing

Locality-Sensitive Functions

Behavior of a (d1, d2, p1, p2)-sensitive function

• d1 and d2 can be made as close possible

• The penalty is that p1 and p2 becomes close as well.

Page 8: Similarity Search in High Dimensions via Hashing

Banding TechniqueAn effective way to choose the hashings is to divide the signature matrix into b bands consisting of r rows each.

Dividing a signature matrix into four bands of three rows per band

Page 9: Similarity Search in High Dimensions via Hashing

Analysis of the Banding Technique

The probability that the signatures becomes candidate pair at least one band: 1 − (1 − s r ) b

This function has the form of an S-curve:

The threshold (the value of similarity s) at which the probability of becoming a candidate is 1/2, is a function of b and r (b = 16, r = 4).

Page 10: Similarity Search in High Dimensions via Hashing

Analysis of the Banding Technique

Values of the S-curve for b = 20 and r = 5

Page 11: Similarity Search in High Dimensions via Hashing

Analysis of the Banding Technique

• Choose a threshold t that defines how similar items have to be in order for them to be “candidate pair.”

• Pick b and r such that br = n, and the threshold t is approximately (1/b)1/r.

• If avoiding false negatives is important, select b and r to produce a threshold lower than t.

• if speed is important and you wish to limit false positives, select b and r to produce a higher threshold.

Page 12: Similarity Search in High Dimensions via Hashing

LSH for CosineLet u be user u's rating vector and v be user v's rating vector and r is a random generated vector. The family of hash functions H:

, where

which shows the probability of u and v being declared as a candidate pair.

Page 13: Similarity Search in High Dimensions via Hashing

LSH for CosineA new family G of hash functions g is defined, where each function g is obtained by concatenating (AND) functions of h1, h2, , ...., hr from family of functions F:

g(t) = [h1(t),........, hr(t)].

We then generate random functions of g(t) for each band (hash table) and construct b hash tables.

Page 14: Similarity Search in High Dimensions via Hashing

LSH for CosineExample: r1 = [-1, 1,1,-1,-1]

r2 = [1, 1,1,-1,-1]

r3 = [-1, -1,1,-1,1]

r4 = [-1, 1, -1,1, -1]

u1.r1 = -6 => hr1(u1) = 0

u1.r2 = 4 => hr2(u1) = 1

u1.r3 = -12 => hr3(u1) = 0

u1.r4 = 2 => hr4(u1) = 1

u1 = [5, 4, 0, 4, 1] u2 = [2, 1, 1, 1, 4] u3 = [4, 3, 0, 5, 2] u4 = [2, 1, 2, 1, 4]

g(u1) = 0101

g(u2) = 0010 g(u3) = 0101 g(u4) = 0110

g(u1) = 0101

Page 15: Similarity Search in High Dimensions via Hashing

Applications of LSH• Near neighbor search • Entity Resolution • Matching Fingerprints • Matching Newspaper Articles

Page 16: Similarity Search in High Dimensions via Hashing

Thank You

Q & A