26
Feb 22, 2008 1 Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban

Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

  • Upload
    lamkiet

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 1

Locality-Sensitive Hashing

CS 395T: Visual Recognition and Search

Marc Alban

Page 2: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 2

Nearest Neighbor

Given a query any point , return the point closest to .

Useful for finding similar objects in a database. Brute force linear search is not practical for

massive databases.

?

qq

Page 3: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 3

The “Curse of Dimensionality”

For , data structures exist that require sublinear time and near linear space to perform a NN search.

Time or space requirements grow exponentially in the dimension.

The dimensionality of images or documents is usually in the order of several hundred or more. Brute force linear search is the best we can do.

d < 10 to 20

Page 4: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 4

(r, )-Nearest Neighbor

An approximate nearest neighbor should suffice in most cases.

Definition: If for any query point , there exists a point such that , w.h.p return such that .

qp

?

jjq ¡ p0jj · (1 + ²) rjjq ¡ pjj · r p0

²

Page 5: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 5

Locality-sensative Hash Families

Definition: A LSH family, , has the following properties for any :

1. If then

2. If then

jjp¡ qjj · r

H (c; r; P1; P2)

jjp¡ qjj ¸ cr

q; p 2 S

PrH [h (p) = h (q)] ¸ P1

PrH [h (p) = h (p)] · P2

Page 6: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 6

Hamming Space

Definition: Hamming space is the set of all binary strings of length .

Definition: The Hamming distance between two equal length binary strings is the number of positions for which the bits are different.

2N

N

k1110101; 1111101kH = 1k1011101; 1001001kH = 2

Page 7: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 7

Hamming Space

Let a hashing family be defined as where is the bit of . Clearly, this family is locality sensative.

hi(p) = pipi ith p

PrH [h (p) = h (q)] = 1¡kp; qkHd

PrH [h (p) 6= h (q)] =kp; qkHd

Page 8: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 8

k-bit LSH Functions

A k-bit locality-sensitive hash function (LSHF) is

defined as: g (p) = [h1 (p) ; h2 (p) ; : : : ; hk (p)]T

Each is chosen randomly from . Each results in a single bit.

Pr(similar points collide)

Pr(dissimilar points collide) · P k2

hi Hhi

¸ 1¡µ1¡ 1

P1

¶k

Page 9: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 9

1

LSH Preprocessing

Each training example is entered into hash tables indexed by independantly constructed .

Preprocessing Space:

l

g1; : : : ; gl

O (lN)

...

l2

Page 10: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 10

LSH Querying

For each hash table Return the bin indexed by

Perform a linear search on the union of the bins.

...

i, 1 · i · lgi(q)

q

Page 11: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 11

Parameter Selection

Suppose we want to search at most examples. Then setting ensures that it will succeed with high probability.

B

k = log1=P2

µN

B

¶; l =

µN

B

¶ log (1=P1)log (1=P2)

Page 12: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 12

Experiment 1

Compare LSH accuracy and performance to exact NN search. Examine the influence of: k, the number of hash bits. l, the number of hash tables. B, the maximum search length.

Dataset 59500 20x20 patches taken from

motorcycle images. Represented as 400-dimensional

column vectors

Page 13: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 13

Hash Function

Convert the feature vectors into binary strings and use the Hamming hash functions.

Given a vector we can create a unary representation for each element .

= 1's followed by 0's, where is the max coordinate for all points.

Note that for any two points :

x 2 Ndxi

xi (C ¡ xi)C

p; q

kp; qk = ku (p) ; u (q) kH

UnaryC (xi)

u(x) = UnaryC(x1); : : : ; UnaryC(xd)

Page 14: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 14

Example Query

Query =

Examples searched: 7,722 of 59,500

Result =

Actual NNs =

l = 20, k = 24, B =1

Page 15: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 15

Average Search Length

Let B =1

l

k

5 10 15 20 25 30

5

10

15

20

25

30

24

22

20

18

16

14

12

10

8

6

4

2

x1000

Page 16: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 16

5 10 15 20 25 30

5

10

15

20

25

30

24

22

20

18

16

14

12

10

8

6

4

2

x1000

Average Search Length

Let B =1

l

k

More hash bits, (k), result in shorter searches.

More hash tables (l), result in longer searches.

Page 17: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 17

Average Approximation Error

Let

5 10 15 20 25 30

5

10

15

20

25

30

1.11

1.1

1.09

1.08

1.07

1.06

1.05

1.04

l

k

B =1

Page 18: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 18

Average Approximation Error

Let

5 10 15 20 25 30

5

10

15

20

25

30

1.11

1.1

1.09

1.08

1.07

1.06

1.05

1.04

l

k

B =1 Over hashing

can result in too few candidates to return a good approximation.

Over hashing can cause algorithm to fail.

Page 19: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 19

Average Approximation Error

Let

l

k

B =1 Over hashing

can result in too few candidates to return a good approximation.

Over hashing can cause algorithm to fail.

5 10 15 20 25 30

5

10

15

20

25

30

1.11

1.1

1.09

1.08

1.07

1.06

1.05

1.04

Average search length = 8000

Page 20: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 20

Average Approximation Error

Let

5 10 15 20 25 30

5

10

15

20

25

30

1.15

1.14

1.13

1.12

1.11

1.1

1.09

1.08

l

k

B = 5500 ¼ N

ln N

Page 21: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 21

Average Approximation Error

Let B = 250 ¼pN

5 10 15 20 25 30

5

10

15

20

25

30

1.6

1.55

1.5

1.45

1.4

1.35

1.3

1.25

l

k

Page 22: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 22

Experiment 2

Examine the effect of the approximation on the subjective quality of the results.

Dataset D. Nistér and H. Stewénius.

Scalable recognition with a vocabulary tree

2550 sets of 4 images represented as document-term matrix of the visual words.

Page 23: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 23

Experiment 2: Issues

LSH requires a vector representation. Not clear how to easily convert a bag of words

representation into a vector one. A binary vector where the presence of each word is

a bit does not provide a good distance measure. Each image has roughly the same number of

different words from any other image. Boostmap?

Page 24: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 24

Conclusions

Approximate Nearest Neighbors is neccessary for very large high dimensional datasets.

LSH is a simple approach to aNN. LSH requires a vector representation. Clear relationship between search length and

approximation error.

Page 25: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 25

Tools

Octave (MATLAB) LSH Matlab Toolbox -

http://www.cs.brown.edu/~gregory/code/lsh/ Python Gnuplot

Page 26: Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

Feb 22, 2008 26

References

'Fast Pose Estimation with Parameter Senative Hashing' – Shakhnarovich et al.

'Similarity Search in High Dimensions via Hashing' – Gionis et al.

'Object Recognition Using Locality-Sensitive Hashing of Shape Contexts' - Andrea Frome and Jitendra Malik

'Nearest neighbors in high-dimensional spaces', Handbook of Discrete and Computational Geometry – Piotr Indyk

Algorithms for Nearest Neighbor Search - http://simsearch.yury.name/tutorial.html

LSH Matlab Toolbox - http://www.cs.brown.edu/~gregory/code/lsh/