Transcript
Page 1: Nearest Neighbor Paul Hsiung March 16, 2004. Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

Nearest NeighborNearest Neighbor

Paul Hsiung

March 16, 2004

Page 2: Nearest Neighbor Paul Hsiung March 16, 2004. Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

Quick Review of NNQuick Review of NN

Set of points P Query point q Distance metric d Find p in P such that

d(p,q) < d(p’,q)for all p’ in P

qp

Page 3: Nearest Neighbor Paul Hsiung March 16, 2004. Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

NN Used In…NN Used In…

Image databases [Pentland et al]Color indexing [swain et al]Recognizing 3D objects [Murase et al]Shapes [Mori et al]Drug testingDNA sequence matching [Buhler]

Page 4: Nearest Neighbor Paul Hsiung March 16, 2004. Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

Tree-based ApproachesTree-based Approaches

Quadtrees– Split middle in all dimensions– Split until no points or one point left

Kd-trees– Split in one dimension– Pick the middle wisely

Ball-trees– Pick two pivots and split

SR-trees– We have rectangles and spheres, so why not combine them

Page 5: Nearest Neighbor Paul Hsiung March 16, 2004. Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

Indyk’s GripeIndyk’s Gripe

Beyond 10 or 20 dimensions, tree-based structures will look at many points

No better than brute force linear search…So he came up with a hash table approach:

Locality Sensitive Hashing (LSH)Rest of talk will be on his paper

Page 6: Nearest Neighbor Paul Hsiung March 16, 2004. Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

LSHLSH

Page 7: Nearest Neighbor Paul Hsiung March 16, 2004. Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

Interlude: Near NeighborInterlude: Near Neighbor

Set of points P Query point q Distance metric d Find p in P such that

d(p,q) < (1+ε)d(P,q)where d(P,q) is the distance of q to its closest point in P

q p(1+ε)d(P,q)

d(P,q)

Page 8: Nearest Neighbor Paul Hsiung March 16, 2004. Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

HashHash

Pick a subset I of random coordinatesHash function, h(p), will return a bucket ID

h(p) = projection of p on I

Page 9: Nearest Neighbor Paul Hsiung March 16, 2004. Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

IntuitionIntuition

If two points are close, they hash to same bucket with some probability p1

If they are far, they hash to same bucket with a smaller probability p2 < p1

Page 10: Nearest Neighbor Paul Hsiung March 16, 2004. Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

Indyk’s HashIndyk’s Hash

Convert coordinates of p to {0,1}d

Use Hamming distance: d(p,q)= # positions on which p and q differ

Example:– p=(0,1,0,1,1,1,0,0,1,0)– I={2,5,7}– Then, h(p)=(1,1,0)

Demo: – http://web.mit.edu/ardonite/6.838/locality-hashing.htm

Page 11: Nearest Neighbor Paul Hsiung March 16, 2004. Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

Why Locality-sensitive?Why Locality-sensitive?

Pr[h(p)=h(q)]=(1-d(p,q)/D)k

– D is the number of dimensions in the binary representation

– k is the size of I We can vary the probability by changing k

k=1 k=2

distance distance

Pr Pr

Page 12: Nearest Neighbor Paul Hsiung March 16, 2004. Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

Now to Use It (Training)Now to Use It (Training)

Generate l hash functions: h1..hl

Store each point p in the bucket hi(p) of the i-th hash array, i=1...l

Page 13: Nearest Neighbor Paul Hsiung March 16, 2004. Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

Now to Use It (Query)Now to Use It (Query)

Retrieve all the points that belong to the buckets: h1(q)..hl(q)

Return the retrieved point that is closest to qThis “solves” the Near Neighbor problem

Page 14: Nearest Neighbor Paul Hsiung March 16, 2004. Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

Indyk’s ResultsIndyk’s Results

Compared with another tree-based algorithmColor histogram dataset from Corel Draw

– 20,000 images, 64 dimensions– Used 1k, 2k, 5k, 10k, 19k points for training– 1k points are used for query– Computed missed ratio – fraction of queries with

no hits

Page 15: Nearest Neighbor Paul Hsiung March 16, 2004. Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

Indyk’s ResultsIndyk’s Results

Page 16: Nearest Neighbor Paul Hsiung March 16, 2004. Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

Results IIResults II

Page 17: Nearest Neighbor Paul Hsiung March 16, 2004. Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

Ugly SideUgly Side

Works best with Hamming distance– Can be extended from L1 and L2 norms

Requires parameter tweaking (size of I and number of hash buckets)

Does not work well on uniform data

Page 18: Nearest Neighbor Paul Hsiung March 16, 2004. Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

BibliographyBibliography

A. Gionis, P. Indyk, R. Motwani. Similarity Search in High Dimensions via Hashing. In VLDB 25th, 1999

J. Buhler. Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing. In Bioinformatics 17(5) 419-428, 2001

H. Murase, S. K. Nayar. Visual Learning and Recognition of 3D Objects from Appearance. In IJCV, Vol. 14, No. 1 5-24, 1995

A. Pentland, R.W. Picard, S. Scalroff. Photobook: Tools for Content Based Manipulation of Image Databases. In SPIE Vol. 2185 34-47, 1994

M.J. Swain, D.H. Ballard. Color Indexing. In IJCV, Vol. 7, No. 1 11-32, 1991

G. Mori, S. Belongie, J. Malik. Shape Contexts Enable Efficient Retrieval of Similar Shapes. CVPR 1 723-730, 2001

Slides: “Algorithms for Nearest Neighbor Search” by Piotr Indyk

Slides: “Approximate Nearest Neighbor in High Dimensions via Hashing” by Aris Gionis, Piotr Indyk, and Rajeev Motwani


Recommended