8
Locality-Preserving Hashing in Multidimensional Spaces Piotr Indyk * Rajeev Motwani t Prabhakar Raghavan Department of Computer Science Department of Computer Science IBM Almaden Research Center Stanford University Stanford University pragb@almaden. ibm.con~ [email protected] .edu rajeev@theory. stanford.edu Santosh Vempala ~ School of Computer Science Carnegie- Mello,, University vempala@cs. cmu .edu Abstract We consider localitg-preserving hashing in which adjacent points in the domain are mapped to adjacent or nearly- adjacent points in the range when the domain is a d- dimensional cube. This problem has applications to high- dimensional search and multimedia indexing. We show that simple and natural classes of hash functions are provably good for this problem. We complement this with lower bounds suggesting that our results are essentially the best possible. 1 Introduction III a recent paper, Linial and Sasson [21] proved the following theorem about, hash functions: Theorem 1 There ezists a ~amilg g o~ functions from an znteger line [1, .... U] to [1, .... R] and a constant C such that for any s’ c [1,..,, u] withIsl< cm fill f E~ are non-expansive, i.e., for any p, q E U d(.f(P)) f(q)) < ~(P! ‘?). The family G contains O(IIJI) functions, each of which is computable in O(1) operations. Their result gives a family of hash functions with the surpris- ing property that points close to each other in the domain al-e hashed to points close to each other in the range. A potential application of their result is to find near neighbors “Supported by NSF Award CCR-9357849, with matching funds from IBM, Mitsubishi, Schiumberger Foundation, Shell Foundation, and Xerox Corporation]. ‘Supported by an Alfred P Sloan Research Fellowship, an IBM Faculty Partnership Award, an ARO MURI Grant DAAH04-96- 1- 0007, and NSF Young investigator Award CCR-9357S49, with match- ing funds from IBM, Mltsublshi, Schlumberger Foundation, Shell Foundation, and Xerox Corporation ‘Supported in part by NSF Natmnal Young Investigator grant ( ‘(’R-9357793 to points in one-dimensional space. ~~iven a set of points on the line, one can hash these points so that we can search for the points closest to a query point as follows: we hash the query q to h(q), then search the neighborhood of h(q) in the hash table to retrieve, say, the nearest k points within some distance J in the domain in time min{k, J}. The adv- antage of the locality-preserving property is that it affords good paging performance: since the neighborhood of q (in the domain) is not scattered all over the range, the neigh- borhood of h(q) in the hash table exhibits good locality of reference. This is counter to Knuth’s suggestion (see [18], p. 54o) that In a virtual memory environment we probably ought to use tree search or digital tree search, instead of creating a large scatter table that requires bring- ing a new page nearly every time we hash a key. Of course, there are many other good ways of retrieving near neighbors in one dimension. However, efficient neas- neighbor retrievaf is considerably harder, and of growing importance, in higher dimensions. The main application comes from information retrieval: the process of retriev- ing text and multimedia documents matching a specified query. Other instances of near-neighbor search appear in al- gorithms for pattern recognition [6, 10], statistics and data analysis [26, 8], machine learning [5], data compression [12], data mining [13] and image analysis [20]. In the case of text retrieval, vector-space methods [3, 28] map each document into a point in high-dimensional space. Sometimes, statistical techniques such as principal components analysis [14], latent semantic indexing [7] or the Karhunen-Lo6ve/ Hotelling transform [16, 22] are used to reduce the dimensionality of the vector space in which the documents are represented, but the number of dimensions could still be very large (say, even 200). In image and multimedia retrieval. a common first step is to extract a set of numerically-valued features or param- eters from the document. For instance, IBM’s Query-by- image-content [11] and MIT’s Photobook [27] extract im- age features such as color histograms (hues, intensities), shape descriptors, as well as quantities measuring texture. Once these features have been extracted, each image in the database may now be thought of as a point in this multidi- mensional feature space (one of the coordinates might, for the sake of a simplistic example, correspond to the over- all intensity of red pixels, and so on). Dimension-reduction techniques can again be applied, in some cases getting down to under 25 dimensions. 618

Locality-Preserving Hashing in Multidimensional Spacessailesh/download_files/Limited_Edition... · Locality-Preserving Hashing in Multidimensional Spaces Piotr Indyk * Rajeev Motwani

Embed Size (px)

Citation preview

Page 1: Locality-Preserving Hashing in Multidimensional Spacessailesh/download_files/Limited_Edition... · Locality-Preserving Hashing in Multidimensional Spaces Piotr Indyk * Rajeev Motwani

Locality-Preserving Hashing in Multidimensional Spaces

Piotr Indyk * Rajeev Motwani t Prabhakar Raghavan

Department of Computer Science Department of Computer Science IBM Almaden Research Center

Stanford University Stanford University pragb@almaden. ibm.con~

[email protected] .edu rajeev@theory. stanford.edu

Santosh Vempala ~

School of Computer Science

Carnegie- Mello,, University

vempala@cs. cmu .edu

Abstract

We consider localitg-preserving hashing — in which adjacentpoints in the domain are mapped to adjacent or nearly-adjacent points in the range — when the domain is a d-dimensional cube. This problem has applications to high-dimensional search and multimedia indexing. We show thatsimple and natural classes of hash functions are provablygood for this problem. We complement this with lowerbounds suggesting that our results are essentially the bestpossible.

1 Introduction

III a recent paper, Linial and Sasson [21] proved the followingtheorem about, hash functions:

Theorem 1 There ezists a ~amilg g o~ functions from anznteger line [1, . . . . U] to [1, . . . . R] and a constant C such

that for any s’ c [1,..,, u] withIsl< cm

● fill f E ~ are non-expansive, i.e., for any p, q E U

d(.f(P)) f(q)) < ~(P! ‘?).

The family G contains O(IIJI) functions, each of which iscomputable in O(1) operations.

Their result gives a family of hash functions with the surpris-ing property that points close to each other in the domainal-e hashed to points close to each other in the range. Apotential application of their result is to find near neighbors

“Supported by NSF Award CCR-9357849, with matching funds

from IBM, Mitsubishi, Schiumberger Foundation, Shell Foundation,

and Xerox Corporation].

‘Supported by an Alfred P Sloan Research Fellowship, an IBM

Faculty Partnership Award, an ARO MURI Grant DAAH04-96- 1-0007, and NSF Young investigator Award CCR-9357S49, with match-

ing funds from IBM, Mltsublshi, Schlumberger Foundation, Shell

Foundation, and Xerox Corporation

‘Supported in part by NSF Natmnal Young Investigator grant( ‘(’R-9357793

to points in one-dimensional space. ~~iven a set of pointson the line, one can hash these points so that we can searchfor the points closest to a query point as follows: we hashthe query q to h(q), then search the neighborhood of h(q) inthe hash table to retrieve, say, the nearest k points withinsome distance J in the domain in time min{k, J}. The adv-antage of the locality-preserving property is that it affords

good paging performance: since the neighborhood of q (inthe domain) is not scattered all over the range, the neigh-borhood of h(q) in the hash table exhibits good locality ofreference. This is counter to Knuth’s suggestion (see [18], p.54o) that

In a virtual memory environment we probably oughtto use tree search or digital tree search, instead ofcreating a large scatter table that requires bring-ing a new page nearly every time we hash a key.

Of course, there are many other good ways of retrievingnear neighbors in one dimension. However, efficient neas-neighbor retrievaf is considerably harder, and of growingimportance, in higher dimensions. The main applicationcomes from information retrieval: the process of retriev-ing text and multimedia documents matching a specifiedquery. Other instances of near-neighbor search appear in al-gorithms for pattern recognition [6, 10], statistics and dataanalysis [26, 8], machine learning [5], data compression [12],data mining [13] and image analysis [20].

In the case of text retrieval, vector-space methods [3,28] map each document into a point in high-dimensionalspace. Sometimes, statistical techniques such as principalcomponents analysis [14], latent semantic indexing [7] orthe Karhunen-Lo6ve/ Hotelling transform [16, 22] are usedto reduce the dimensionality of the vector space in which thedocuments are represented, but the number of dimensionscould still be very large (say, even 200).

In image and multimedia retrieval. a common first stepis to extract a set of numerically-valued features or param-eters from the document. For instance, IBM’s Query-by-image-content [11] and MIT’s Photobook [27] extract im-age features such as color histograms (hues, intensities),shape descriptors, as well as quantities measuring texture.Once these features have been extracted, each image in thedatabase may now be thought of as a point in this multidi-mensional feature space (one of the coordinates might, forthe sake of a simplistic example, correspond to the over-all intensity of red pixels, and so on). Dimension-reductiontechniques can again be applied, in some cases getting downto under 25 dimensions.

618

Page 2: Locality-Preserving Hashing in Multidimensional Spacessailesh/download_files/Limited_Edition... · Locality-Preserving Hashing in Multidimensional Spaces Piotr Indyk * Rajeev Motwani

Both for text and for images (as well as for other mul-timedia documents), a user-detined query is transformedto the same vector space where the documents are repre-sented. The retrieval question now turns into one of findingnear neighbors in this space. Most typically, the query oneseeks to answer is of the form “give me the k nearest neigh-bors, but don’t give me anything that’s more than distanceJ away.” In practice k is usually a smalf number (such as8), and the bound 6 depends on the representation: the in-tent is to get (say) the nearest 8 images as long as thereare at least 8 images ‘(close by. ” (Users are typically notinterested in seeing 8 images if many of them do not matchthe query very welf. ) Most current retrieval systems resortto brute-force linear seach, computing the distance from thequery to all the points in the database. We know of no sys-tem embodying high-dimensional search that has sublinearworst-case query performance.

Generalizing the approach of Liniaf and Sasson to higherdimensions offers promise as a way of tackling this prob-lem. In this paper we give such a generalized constructionof locality-preserving hash functions in higher dimensions,together with negative results suggesting that our construc-tion is theoretically the best possible. In addition to thetheoretical value of these results, our construction wilf inprinciple afford fast retrieval and good paging behavior dur-ing search. Realisticafl y, though, it will in practice only workfor modest values of the dimension d (say, 10-20). Neverthe-less it offers a simple approach for indexing problems (if thefeature space after dimensionality reduction has moderatedimension) and in iterative computations for sparse finite-element relaxation methods. To put this in perspective, wesurvey some other approaches to near-neighbor finding.

Approaches for Near-Neighbor Search. Samet [29]surveys a variety of data structures used for this problemincluding variants of k-d trees, R-trees, and structures basedon space-filling curves. While some of these structures per-form reasonably well in 2 or 3 dimensions (and even admitprobabilistic analyses for very simple probability y distribu-tions from which the points in the database are drawn),they all exhibit behavior that is poor in the worst case (andusually in typical cases as welf).

The field of computational geometry has developed a richtheory for the study of proximity problems. Establishing

upper bo~ds on the time required to answer a nearest-neghbor query in !Rd appeam to have been first undertakenby Dobkin and Lipton [9]; they provided an algorithm with

query time 0(2d log n) and pre-processing 0(n2d ). This wasimproved by Clarkson [4]: he gave an afgorithm with query

time O(exp(d) log n) and pre-processing O(n (d121tl+cJ ); hereexp(d) denotes a function that grows at least as quickly as

2d. The query time was later improved by Meiser [24] to

0(d5 log n) with pre-processing O(nd+’ ). Recently, Klein-berg [I 7] has developed a scheme for approximate nearestneighbor problem that achieves query time 0(cf2 log n ) with

preprocessing no(d). There have been a number ‘of” other

approaches and extensions (e.g. [31, 23, 25, 1, 2]). The bestapproaches from these studies are still impractical for thevafues of d encountered in the retrievaf applications above.

Overview of Paper. In Section 3 we give a construc-tion for locality-preserving hash functions in two dimen-sions; this is extended to higher dimensions in Section 4.For d-dimensions, our functions are m-expansive under the12 norm, d-expansive under the 11 norm, and non-expansiveunder the lM norm. The constants in our guarantees (bucket

size, collision probability) grow with the dimension, roughly

as O(cd ] where c is a fixed constant for 12 norm ancl O(d~ )

for 11 and 1~. In Section 5 we present severaf negative re-sults exploring the intrinsic limitations of locality-preservinghash functions. First, we estabfish a lower bound of fl(l/R)on the collision probabifi ty for any family of non-expansivefunctions in d ~ 2 dimensions; this implies that obtaininglow collision probability is essentially impossible for higherdimensions. Then we restrict ourselves to polynomial sizedfamilies of natural subclass of non-expansive hash functions.We show that even if we allow to store up to c elementsin each bucket, no such family is able to hash more than

roughly O(R1’1 ‘c) elements. Both results suggest that re-laxation of the non-expansiveness constraint is essential inorder to provide good bounds. We conclude in Section 6with a number of issues which merit further study.

2 Preliminaries

The domain of the hash functions we consider in this paperis a d-dimensional cube D = {—U, . . . . U}d. The range ofthe functions wilf be a set of points in a d-dimensional cubeZ of side R. In this section we present our results for thecase d = 2; in Section 4 we outline the generalization tohigher dimensions. Although we consider onfy cubes (i. e.,with equal-length sides) for the domain, the results can beextended to cuboids with unequal sides. Let d(p, q) be thedistance between any points p and q. We will use the nota-tion that p = (ZP, UP) and q = (zq, yq). We define

s dz(p, q) = Izq–zpl and du(p, q) = Iyg–YPl,

● dr(p, q) = (d~(p, q)r + dy(p, q)r)~, for any r z 1.

These definitions extend naturafly to d >2 dimensions.

Definition 1 A function h : D + I is c-expansive undera distance metric d it for an~ p, q E D, d(h(p), h(q)) gd(P, g) + C. If c = O, then h is said to be non-expansiveunder d.

In our constructions we use the l-dimensional familyof hash functions ~ introduced by Liniaf and Sasson [21].Their functions are obtained by “folding” the domain Dalong randomly-chosen turning points in such a way thatany segment of D between two consecutive turning pointshas length E)(R) (see [21] for a formaf definition). Besidesthe properties described in the introduction, g has the prop-erty that Prg=g(g(z) = g(y)) = 0(1/R) for anY x # Y E D.

In the sequel, we often omit the subscript of Prfe~ whichdenotes the probability space from which f is chosen - itshould be assumed that ~ is chosen uniformly at randomfrom the probability space implicit from the context.

3 Hashing in Two Dimensions

The main theorem we prove in this section asserts that wecan achieve 0(1 ) bucket size using locality-preserving hashfunctions in the 2-dimensional setting. To this end, we de-scribe a family of l-expansive hash functions X2 C {h : D +1} based on applying random rotations to the 2-dimensionaldomain cube 1.

Formalfy, we define 7-12to be all functions h of the formh(p) = tz(g(r(p))), where for P = (ZP, YP),

● t=(p) = ( [zp/c] , [yp/cj );

619

Page 3: Locality-Preserving Hashing in Multidimensional Spacessailesh/download_files/Limited_Edition... · Locality-Preserving Hashing in Multidimensional Spaces Piotr Indyk * Rajeev Motwani

dP) = (g~(~p),gy(gp)),where g~,gy E 4;

To define r(p) we need some extra notation. Let pti

be the column vector whose coordinates are the coor-dinates of p. Let q“ = (z~, y:) = Mp” where M is a

z x z rotation matrix with column sums equal to 1.Then r(p) = ([z:J, [~~] )

The function r rotates the domain cube, and then roundsoff the rotated points to the nearest lattice points. It wilfsuffice to restrict the entries of M to the discrete set ofmultipIes of 1/R in the range [0, 1]. Hence, the size of thefamily M2 is equal to l~lR2.

Theorem 2 There ezi~~constants C and B such that forany domain size U and and range size RI the functions h c

?lz are 1-expansive under the dl distance metric and have

the property that for any p G I and ang S c D with ISI sCl?,

Pr(lh-’(p) nSl s B) > ~.

The proof of Theorem 2 is obtained using Lemma 1,Lemma 2 and Lemma 4 (proved below).

Lemma 1 Each h E W2 is I-ezpansiue.

Proofi It is easy to verify that each function r is 2-expansive,hence so is g o r. Application of tzreduces the expansion to1. ❑

To bound the number of collisions, we look at two sep-arate cases: pairs p, q which are close to each other (i.e.,

dm (P! q) < R) and those that are far from each other (i.e. Id~ (P, q) Z R The fo~owing fact takes care of the first c~e.

Lemma 2 There exists a constant B1 such that for eachp E D and any h E ‘H’, the number of q c D such that

d-(P! q) < R and h(p) = h(q) is less than B1.

Proofi Follows from the construction of g, and the obser-vation that for any p E D, the number of q E D such thatr(p) = r(q) is bounded by a constant. o

Lemma 3 Let p, q E D be such that dw (p, q) ~ R, thenthere is constant A such that r(p), r(q) difler in both coordi-

nates with probability at least 1 – A/R.

ProoE Let . .

L 1

such that u] + U2 = VI + U2 = 1. Let p, q be such that

dco (P q) = IZP – ~ql Z R. observe that

r(p)l.–dq)l. = (Jf~)l.-(M4l. = ul(~p–4+w(~p-vq),

ffp)iy–r(q)ly = (JfP)ly-(~q)ly = u2(~p–~q)+v2(Yp –yq).

Then there is some constant A such that with probabil-ity at least ( 1 — A/R), both expressions above are at least2 in magnitude. After the rounding step, the points willremain different on this coordinate. We conclude that withprobability at least 1 – A/R, both coordinates of r(p) andr(q) will be different. ❑

[n order to prove the next lemma, we need the follow-ing fact which follows from the analysis due to Linial andsasson [21],

Fact 1 There is a constant Cl such that for anyp, q E D zfp and q differ on k coordinates, then

Pr(t~(g(p)) = tz(g(q))) ~ ~

Lemma 4 There ezists a constant C2 such that for ang

p,q c D, dco(p, q) 2 d,

Pr(h(p) = h(q)) < ~.

Proofi We proceed as follows:

Pr(h(p) = h(q)) s Pr(t2(g(r(p))) = t2(g(r(q))))

A C’1

< Z“F+

Pr(tz(g(r(p))) = tz(g(r(q))) I

r(p)l r(9) differ in 2 coordinates)C2

~~

4 Hashing in Higher Dimensions

In this section we extend the localityifies to any dimension d. To this end

•1

preserving hash fam-we use more conven-

tional version of rotation matrices, i.e., where the columnsof the rotation matrix M are required to be orthonormal.This preserves d2 distance between points, and allows us toobtain results for locality-preserving hashing under the d2

distance metric. The results for other metrics then easilyfollow from the fact that any metric d, for r ~ 1 can be

approximated by dz with only @factor distortion. Theconstants in our guarantees (bucket size, collision probabil-ity) grow with the dimension, roughly as O(cd) (where c is afixed constant ) for a hash family that preserves d2 distance

and as O(dd) for hash families that preserve other distances.We define %d as the collection of all functions h of the

form h(p) = g(r(p)), where for p = (z I,z2, . . . ,~if),

● 9(P) = (9x1 (~1), . . . ,9z~(~d)), where 9Z,, ~~~,9z~ C ~;

● We define qv = (z; ,.. . , zj) = Mp” where M is adx d rotation matrix with orthonormal columns. Then,r(p) = (l~~j,..., [v;]).

Theorem 3 There ezist constants C and B such that forany domension d, domain size U, and range size R, thefunctions h ~ %!d are &-expansive under the d2 distance

metric and have the property that for any p E I and any

S c D with ISI s CRdf2j

Pr(lh-l(p) nsl s Bd) > ~.

First, we establish rd-expansiveness.

Lemma 5 Each h E ?& :s @-expansive.

Proofi Let h(p) = g([M(p)j). As g and M are non-expansive, it suffices to note that for any p = (P1, . . . . pd)andg=(ql, ..., qd) from J?d, Ilp, –qll – l~iJ – Lq,]ll <1 fori = 1, . . . . d and hence rounding can change each coordinateof a difference vector between two points by at most 1. Theresult then follows from the triangle inequality. ❑

620

Page 4: Locality-Preserving Hashing in Multidimensional Spacessailesh/download_files/Limited_Edition... · Locality-Preserving Hashing in Multidimensional Spaces Piotr Indyk * Rajeev Motwani

We now concentrate on bounchng the bucket size. Tothis end, the following fact will be useful.

Fact 2 Let&(r) be thed-dimensional ball of radiusr cen-tered at the origin, and~et Sal(r) be the bounding sphere ofBd(r). Then, ietting r(z) denote a gamma function,

d ~dj2

● the !JOfUflie l~d(r)l of ~d(r) is equal to ‘—d r(d/2) ‘

● the area lSd(r)l ofSd(~) ,S ‘~~~fl’ .

The proof again proceeds in two stages. First, we boundthe number of collisions caused by elements that are closeto each other. Then, we bound the probabilityy that twoelements that are far apart are mapped to the same bucket.

Lemma 6 There ezi$ts a constant B] such that for any p ED and any h < %!d, there am at most B? choices of q ~ Dsuch that dz(p, q) < ~R and h(p) = h(q).

Proofi The proof consists of two parts. First, we bound thenumber of points mapped to the same location by r. Thenwe bound the number of q’s as above mapped to the samelocation as p by g.

For the first part, we need to bound the number of pointsfrom d-dimensional unit lattice contained in any (possiblyrotated) unit cuboid C. It suffices to bound the numberof lattice cuboids intersecting C, as each such cuboid con-tributes at most 2d points. To this end, we bound the num-

ber of lattice cuboids intersecting a ball (of radius @ cir-cumscribed around C, which again can be bounded by the

number of points contained in a ball of radius 2fi around C.As the volume of each cuboid is 1, it is sufficient to estimate

the volume of a d-dimensional ball of radius 2ti, which is

2(2ti)d Tdfz < =d/2 (Zti)d

d r(d/2) – (d/2 - 2)! s ‘f

For the second part, notice that each g splits the do-main into disjoint cuboids of side at least O(R) such thatno two points from the same- cuboid overlap. Hence it issufficient to bound the number of such cuboids intersecting

a d-dimensional ball of radius ~dR. This can be done by anargument similar to the one above. ❑

Lemma 7 .Letp, q ~ D be such that dz (p, q) ~ /dR. Then,there ezists a constant A such that

()()d AkPr(r(p), r(q) are equal in k coordinates) s ~ ~ ~ .

Proofi Let 1 = dz(p, q) and z = q – p. It is sufficient toprove that for any A’ C {1 ,.. ~) n}) the probability that R(P)and r(q) are equal on K is at most ~d(A/R)k. Without lossof generality, we assume K = {1, . . . . k}. To complete theproof of this lemma, we will employ the following two facts.

Fact 3 Let o,p, q be any points in $?d. Furthermore, A40 bea matrix of a a random rotation with origin at o, and simi-lurly let Mq be a matrix of a random rotation with origin atq. Then, the distributions of Mo(g – p) and Mq(q – p) areidentical.

Fact 4 Let M be a random orthonormal matrix. Then forany x ~ !Rd, Mx is a mndom unit vector.

Proofi Let M = (UI, . . . . ~d) be chosen thus: the first col-umn, U1, is a random unit vector, i.e., a vector chosen atrandom from the unit sphere. The ith column, u,, is chosenat random from the set of unit vectors orthogonal to thefirst i – 1 columns.

Consider z = (1, O, . . . . O) c Rd. Given the manner weselected the first column of M, the vector Mx is a randotfiunit vector.

Now consider an arbitrary unit vector y. There existsan orthonormal matrix S such that Sy = z, in other words,y = s~x. We wish to show that My is a random unit vector.But My = (MST )z. The orthonormal matrix ST viewed asa function from the unit sphere to the unit sphere is 1-1and space-preserving. (Note that for a discrete version ofthe lemma, e.g., if all coordinates are multiples of l/R, the

1-1 property suffices). Hence the matrix MST k a randomorthonormal matrix, just like M. Itfollows that (MST )Zis uniformly distributed on the unit sphere, implying thedesired result. D

Now we proceed as follows. From Fact 3 we may assumethat the origin of rotation is located in q. By Fact 4 it issufficient to consider only z = (1,O,. . . . O). Hence, Mz = lU)where u = (UI, . . . . Ud) is the fist column of M. It is nowsufficient to show that

Pr(lMzl=,l ~ 1 for all i c K) = Pr(lluil ~ 1 ford t E K)

Let C; be the set W-l x [–1, 1] x !Rd-’ and let 1$ =sin-l(l/l). The probability Pr(luil <1 for i E K) is equalto the area of the surface of

~~(1) = Bd(z) n n~~xc~

divided by lsd(~) 1. The area of the surface of P}(l) is

Hence, lPf(l)l < l&_~(~)l . 2k and

2k d[k/21

5 ~k12(R@k

Lemma 8 There exists a constant C2 such that for any

p, q E D, with dz (p, q) ~ ~dR,

Pr(h(p) =h(q)) < (~)d.

621

Page 5: Locality-Preserving Hashing in Multidimensional Spacessailesh/download_files/Limited_Edition... · Locality-Preserving Hashing in Multidimensional Spaces Piotr Indyk * Rajeev Motwani

Proofi We recall Fact 1 and proceed as follows: 5.1 Lower Bounds on Collision Probability

F’r(h(p) = h(q)) s Pr(t, (g(r(p))) = tz(g(r(q)))) Theorem 5 For any family of non-expansive functions N

dunder the dl distance metric, there exists a pair z, y E 11

S ~ Pr(~2(9(r(P))) = t2(9(r(9))) Isuch that Pr(~(z) = ~(~)) = Q(+).

Proofi We prove the theorem for d = 2 (d ~ 2 can be

r(~)t r(q) are equ~ in k coordinates) handled similarity). Let h E H. We define the c-crease set

x Pr(r(p), r(q) are equal in k coordinates) Of has

, ~ti(:) (+) ’-’ (:)’Cc(h) = {p E D I ~q#Ph(p) = h(q), dl(p, q) < c}.

k=O The following lemma holds.

Lemma 9 There ezists a constant C such that for an~ la :D~I,w~~.

For any distance metric d, where r # 2, we have the Proofi Divide D into disjoint slices,i.e., rectangles of length

following. R+ 2 and height 3. We show that each slice contains at leastone point from Cq(h). Without loss of generality, we exan-

Theorem 4 ‘l’here ezist constants C and B such that for ine only the set [1. . . ..R +2] x [1.....3]. Let L = {(i,2) Iany dimension d, domain size U, and range size R, there ex- i= 2.. . . .R + 2} and consider the set h(L). If h((2, 2)) =ists a fanhilv ~~ c {h : D + I} of d’1’-expansiue functions h((3, 2)), there is nothing to prove. In the opposite case,(under the d. distance metric) such that for any p E I and either dZ(h(2, 2), h((3, 2))) = 1 or dv(h(2, 2), h((3, 2))) = 1.

any S c D with ISI s CRdt2, Assume the first case (the second is symmetric). Then

Pr(lh-’ (p) nsl < (Bd)d) > ~.

Proofi Take the family as in Theorem 4 and apply td toeach coordinate. The proof folfows from Theorem 4 andthe fact that for any points p and q, and any choice of r,the distance metrics d=(p, q) and dr(p, q) differ by at most afactor d. ❑

Remark 1 By a more detailed analysis we can establishthat functions similar to the ones defined above are in factd]lr – 1 expansive. This specifically implies existence of hashfunctions which are non-expansive in d~.

Remark 2 It may seem natural to try to improve the O(dd)bound for dl anddm by applying a method similar to the onein the proof of Theorem 3. However, this method requires“random rotation” and it is not clear if a similar transfor-mation exists for norms other than 1=. In fact, the notionof rotation in 1P for p # 2 is not well-defined. The reasonis that the rotation is defined as a linear transformation de-scribed by a matrix with orthonormal columns. However, itis known that for Rd with 1P norm, where p # 2, no innerproduct ezists (cf. [19], p. 133), and so orthonormality insuch spaces is not well-defined.

5 Negative Results

( ;an small bucket size be achieved with non-expansive hashfunctions? In this section we prove lower bounds that in-dicate the contrary. First we focus on lower bounding thecollision probability for non-expansive functions. Then weshow that in fact small bucket size is not possible for a sub-class of non-expansive functions which we call multifoidings.It is an open problem as to whether any non-expansive func-tion can be viewed as a multifolding. We conjecture that thisis indeed the case.

● either h(L) forms a line, i.e., dv(h(p), h(q)) = O forevery p, g ~ L; or,

● there exists i ~ 3 such that dv(h(i, 2), h(i + 1, 2)) = 1.

In the first case clearly for every p, q, dl (h(p), h(q)) < R(as R is the width of 1), so lh(L)l < R. As ILI = R+l, thereare two points p # q such that h(p) = h(q) (assume ZP <

X9). Consider the sequence S = h((zP,2))lz . . . h((zq, 2))1=,Its first and last elements are equal and (due to the non-expansive property of h) the consecutive elements of S differby at most 1. It follows that there exist two points p’, q’ G Lsuch that dl(p’, g’) s 2 and h(p’) = h(q’).

For the second case assume that i is the smallest indexsuch that dv(h(i, 2), h(i + 1, 2)) = 1. Then either dz(h(i –1, 2), h(i, 2)) = O (and there is nothing to prove) or d=(h(i –1, 2), h(i, 2)) = 1. In the latter case, notice that there is a setT of 6 points in D different from T’ = {(i – 1, 2), (i, 2), (i +1, 2)} which are within distance 1 to some point from T“ ={(a – 1, 2), (t + 1,2) }, while there are only 5 points in Idifferent from h(T’) within distance 1 to some point fromh(T”). Hence at least two points from T UT’ are mappedto the same location. As for any p, q E T U T’ dl (p, g) s 4,the lemma follows. ❑

By a simple counting argument it follows from Lemma 9that there exists p c U such that Pr(p E Cq(h)) ~ C~/R.Consider the (at most) 24 points q # p s.t. dl (p, q) ~ 4.Clearly, for at least one of them Pr(h(p) = h(q)) ~ ~. ❑

Theorem 6 For any family of non-expansive functions Hunder the dw distance metric, there exists a pair x, y E Dsuch that Pr(~(z) = f(u)) = ~(~).

Proof: It is sufficient to prove Lemma 9 for dm metric. Tothis end, define L = {(i, 2) I i = 2. . . . .R+ 2} as before andconsider the following cases:

Case 1: for all p,g C L, d~(h(p), h(q)) = O, or for all p, q ~L, d=(h(p), h(q)) = O - we proceed as in the proof ofLemma 9.

622

Page 6: Locality-Preserving Hashing in Multidimensional Spacessailesh/download_files/Limited_Edition... · Locality-Preserving Hashing in Multidimensional Spaces Piotr Indyk * Rajeev Motwani

Case 2: dz(h(2, 2), h(3,2)) = dv(h(2,2), h(3,2)) = 1. Thenthere exists a set T of 4 points from D different from7“ = {(2, 2), (3, 2)} within distance 1 to both pointsfrom T’. However, there are at most 2 points in Idifferent from h(T’) within distance 1 to both pointsin h(T’), so at least two points from T’ u T collide.

Case 3: dr(h(2, 2), h(3, 2)) = O and dy(h(2, 2), h(3, 2)) = O;we are done.

Case 4: d*(h(2, 2), h(3, 2)) = 1 and dy(h(2,2), h(3, 2)) = O.Let i be the smaflest index such that dv(h(i, 2), h(i +1,2)) = 1 and let (Az, Ay) = h(i + 1,2) – h(z,2).[f IAx[ = 1, we proceed as in case 1. If IAzI = O,there exist a set T of 4 points from D different fromT’ = {(i, 2), (z’+ 1, 2)} within distance 1 to both pointsfrom T’. However, there are at most 3 points in 1different from h(T’) within distance 1 to both pointsin h(T’), so at least two points from T’ U T collide.

Case 5: dt(h(2, 2), h(3, 2)) = O and dg(h(2, 2), h(3, 2)) = 1—symmetric to Case 4.

Theoretn 7 There exists a set S o} size 3(R + 2) such thatno non-expansive (under any distance metric) function h :D ~ I is one-to-one on S.

Proofi Follows from the proof of Lemma 9. E

5.2 Lower Bounds on Multifolding

We need some further notation. With any s = (s1, . . . . Sd) 6Y?dand t E $? we associate a hyperplane

~(s,c)= {9 ERdls”q= t}.

In the sequel, we use both (s, t) and L(.,t) to denote a line;when convenient, we will think of the line as lying in Dinstead of !Rd. We refer to s as the slope of L(., t)). For anyline L(,,,) we also detine its positive side

and negative side

L~,t)={q ERdls. g< t}.

A real folding along line L is a function TL : $?d + $?d suchthat

{r~(p) = p

ifp~L+a reflection of p about L otherwise

A jolding afong L is a function fL(P) = ( [TL(P)I=], [rL(P)l#]).

A folding along any line crossing a point in 2d with a slopefrom the set B = {–1, O, I}d – {O, O} is called simple. Amuitifotding is any composition of foldings and translationsby vectors in Zd. We define a real rnulti~oldinganalogously.A simple multifolding is any composition of simple foldingsand translations.

The following theorem says that haah function familiesconsisting of simple multifolding have large bucket sizeseven when hashing small sets.

Theorem 8 For any c ond d there ezzsts c] such that forany R, and any family X of simple multi foldzngs h : D ~ 1,

there ezists a set S of clR]–~ log I’HI points such that forany h E ‘U there exists q c I such that Ih-l(q) n ,S[ > c.

Proofi We prove the theorem for d = 2 (d z 2 can behandled similarity). We assume U = kR for some constantk dependent on c. The following lemma will be helpful.

Lemma 10 Let h : X ~ Y be a junction such that for atleast a fraction al o} x E X we have Ih-’(h(x))l z c, forsome constants al and c. Then there exists c1 such that for

any c >0 for a random S C X of cardinality cllX1l–$ log ~

Pr(for some y G Y, Ih-l(y) n S] ~ c) >1 – c

Proofi Let m = [Xl. Let Al, . . . . Al be a family of dis-joint c-subsets of A, such that lh(A, )1 = 1 holds for eachi=l , . .../. Clearly, we can ensure 1 > al lX1/2c. Nowassume that we sample t elements of X ~ndependently anduniformly at random without replacement (clearly, samplingwith replacement can only improve our bounds). It is now

sufficient to show that if t ~ c1m 1‘:, then there is a con-stant probability of having bucket size at least c (the log l/cfactor follows then from the fact that we can repeat performsampling several times). To this end, first notice that thereis a constant probability that at least t’ = al t/3 sampled el-ements belong to UA i, so assume we sample t’elements fromthis set. For any Al, if at least c sampled elements belong toA,, there is a constant probability that they hit all elements

of A~. To complete the proof, notice that sampling @(l”~ )elements from the set of size 1 with a constant probabilityresults in some element being sampled at least c times (thisis a generalized version of the well-known birthcfa y pro blemfore> 2). ❑

For any slope s we define kf,,~ = {L(,,,.*) I i E Z}. Wedefine the set S to be a random subset of a set

XR = D n (USG~MS,R)

In other words, XR consists of subintervals of equalfyspaced horizontal/ vertical/diagonal lines, such that the con-secutive lines of the same slope are within distance O(R)from each other. We show that for any multifolding h theset XR satisfies the condition for the set X of Lemma 10.To this end, we need a following

Lemma 11 There exist lines T, for s c B (called thresh-olds) and rectangle shaped regions A. c D for s E B of sidebU, for some constant b >0, such that:

● (o, O) G T$ for all s E B,

● A, C T,– for all s E B,

● A, C T-$ for ail s, s’ c B such that s # s’,

● fT, (A,) C T!., for any s E B (i. e., A. after muppingalong T, cannot exceed T-,),

● jor any s ~ B, the distance between T, and T–.. is at

least ~.

Proofi Refer to Figure 1. ❑

623

Page 7: Locality-Preserving Hashing in Multidimensional Spacessailesh/download_files/Limited_Edition... · Locality-Preserving Hashing in Multidimensional Spaces Piotr Indyk * Rajeev Motwani

....{ ..,. ...” ~...

Figure 1: D is depicted as a square, thresholds are dotted.

Without loss of generality, we may assume that h is ex-clusively composed of foldings. Let h = hl o hz . . . 0 hl.Also let L, be the line defining folding h, and ~i be a slopeof L,. We may assume that (O,O) 6 L? and that for anyl,j= l,... ,1 such that s, = Sj we have L; C L~ (in otherwords, lines of the same slope are monoto~c). We say thata folding h, is large, if the distance between Li and the lastline L, of the same slope as L, is greater than W, for someconstant b > a > 0 to be defined later. Otherwise, we calfthe fokiing small.

Lemma 12 For any t = 1,...,1, if no folding h, from the3ethl, . . ..ht “ezceeds” its threshold T,, (i. e., it is not thecase that L~ c T~), then hl(. .ht(A, )) = A, for anys E B(i.e., sets A, remain unchanged by h,, . . ,h,).

The lower bound strategy may be now described as fol-lows. We proceed in phases. In the first phase we apply theconsecutive foldings hl, hz, . . .. We stop the phase at ht ifone of the following happens:

Case 1: ht is large; or,

Case 2: ht exceeds its threshold and no h, for i s tis large.

{.;learly, one of these two cases has to happen, as someh, has to exceed its threshold (recall that D is mapped to asquare of side R). We proceed as foUows. Let s = st.

Case 1: In this case we know that both width and lengthof the rectangular shaped region A = AS n ~~ k morethan aU/2. This region is mapped by ht to a rectangu-lar region A’ C Lf of the same size. As a result of thismapping, the intervak from M n A belonging to linesof slope s are mapped to the corresponding intervalsin M n A’. We can now restrict our attention only tosome square region D’ C A’, i.e., apply the above ad-versary strategy assuming D := D‘, starting the nextphase of the adversary strategy. Clearly, if this recur-sion step is (at some moment t’) applied more than 4Ctimes, then for some slope and for any point p fromintervals of this slope belonging to the final square D“we have l(h~o.. .ohl)-l(p)l z c. The constants k andb can be easily chosen such that the side of D“ is still

> 2R. Hence the number of points p as above is atleast R.

Case 2: Let t be the folding which exceeds T. and for sim-plicity assume the first phase. As we know that allfoldings (including the ones along the slope s) weresmall, it implies that all intervals from A, (of lengthat least bU ) were mapped to the corresponding inter-vals of length at most aU. By the pigeonhole principle,there have to be at least ( 1 – ac/b)U points of thoseintervals mapped to locations to which more than celements were mapped. By choosing b large enoughwith respect to a, we can ensure that at least R pointsare mapped to buckets of size at most c.

The above analysis shows that for any h E R the fractionof points from XR mapped to buckets of size at least c isconstant. We can apply Lemma 10 with e = 1/21HI to showthat a random subset S of XR of cardinalit y c1RI–l” con-tains such a point with probability c. Hence the probabilitythat such a point from S exists for each h s U is at least1/2, which proves that a set S with such a property exists.Q

Remark 3 The assumption that all slopes of foldings belong

to B is not crucial. A slight modification of the above proofshows that the theorem holds for any set of slopes, providedthe absolute difference of angles between any two slopes isconstant. Also, if c = 2, the theorem holds without thisassumption. Finally, a set of size O(R) can be constructedfor any constant number of slopes, for any value of c.

Definition 2 A t-multifolding is any function h : Zd * Zdsuch that there exists a real multifolding f such that for any

p)q c Zdj d(h(p), h(q)) < d(f(p), f(q))+ t.

Remark 4 Theorem 8 holds when “multifolding” are re-placed by “t-multifoidings” (the constant c1 depends then onboth c and t). The proof needs only minor rnodijications -lines are made “thicker.”

We conclude this section with the folfowing conjecture.

Conjecture 1 Every non-expansive hash function h : D ~I is a multifolding.

6 Summary and Further directions

In this paper we have given a locality-preserving hashingscheme with for set size @(Rd12 ), while showing a bound ofO(R1” ) for non-expansive hashing schemes based on multi-folding. These bounds complement each other (modufo ourconjecture that multifolding are comprehensive). We alsogive a non-expansive hash family with collision probabilityI/R (for d~ ), while showing a matching lower bound forcollision probability for non-expansive hashing in d~.

We now mention some extensions and further directions.If the dilation is allowed to be multiplicative (not additive),then one can modify the family described above by simu-lating each bucket storing E elements by E buckets, eachstoring at most one element. In this way we can obtainperfect family of hash functions with constant multiplica-tive and W additive expansion terms. Can one get smallbucket size with O(R) elements (rather than O(R1 f2 ))? It isquite easy (by adopting the hashing scheme of [21]) to storeO(R) elements in O(R) buckets such that each element canbe stored in at most O(log log R + log d) buckets and each

624

Page 8: Locality-Preserving Hashing in Multidimensional Spacessailesh/download_files/Limited_Edition... · Locality-Preserving Hashing in Multidimensional Spaces Piotr Indyk * Rajeev Motwani

bucket has 0(1 )’. O(log log R) size. Can it be achievecl with

a constant depending only on d? Finally, an important openproblem is to improve the upper bound on the bucket size,especially the O(dd) bound for dl

The other questions concern lower bounds. Can we provethe folding theorem for arbitrary slopes for c > 2? Canwe prove that any non-expansive (in the Manhattan or Eu-clidean metric) function can be represented by simple foldingfunctions (or general foldings)? Can we show that exponen-tial bucket size is inevitable for haah functions with smalladditive illation ?

Acknowledgments

The first author would like to thank Moses CharikarSuresh Venkatasubramanian for helpfid discussions.

References

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

P.K. A~arwal and J. Matousek. “Rav shootimz

and

and~.parametric search,” Proc. 2?4thSTOC (i992), 517:526.

S. Arya, D. M. Mount, N.S. Netanyahu, R. Silverman,A. Wu, “An optimal algorithm for approximate nearestneighbor searching,” Proc. 5th SODA (1994), pp. 573-582.

C. Buckley, A. Singhal, M. Mitra, and G. Salton, NewRetrieval Approaches Using SMART: TREC 4. Proc.Fourth Ted Retrieval Conference, National Institute ofStandards and Technology, 1995.

K. ClarkSon, “A randomized algorithm for closest-pointqueries,” SIAM J. Computing, 17 (1988), pp. 830-847.

S. Cost and S. Salzberg, “A weighted nearest neigh-bor algorithm for learning with symbolic features,” Ma-chine Learning, 10 (1993), pp. 57–67.

T. M. Cover and P.E. Hart, “Nearest neighbor patternclassification,” IEEE Transactions on Information The-ory, 13 (1967), pp. 21-27.

S. Deerwester, S. T. Dumais, T.K. Landauer, G. W. Fur-nas, and R.A. Harshman, “Indexing by latent semanticanalysis,” Journal of the Societ~ for Information Sci-ence, 41 (1990), pp. 391–407.

L. Devroye and T.J. Wagner, “Nearest neighbor meth-ods in discrimination ,“ Handbook of Statistics, vol.2, P.R. Knshnaiah, L.N. Kanal, eds., North-Holland,1982.

D. Dobkin and R. Lipton, “Multidimensional searchproblems,” SIAM J. Computing, 5 (1976), pp. 181-186.

R.O. Duda and P.E. Hart, Pattern Classification andScene Anaiysis, Wiley, 1973.

(.;. Faloutsos, R. Barber, M. Flickner, W. Niblack, D.Petkovic and W. Equitz, “Efficient and effective query-ing by image content”, Journal of Intelligent Informa-tion Systems, 3 (1994), pp. 231-262.

A. Gersho and R.M. Gray, Vector Quantization and Sig-nal Compression, Kluwer Academic, 1991.

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

T. Hastie and R. Tibshirani, “Discriminant adap-tive nearest neighbor classification,” First InternationalConferwce on Knowledge Discovery and Data Mtning,1995.

H. Hotelling, “Analysis of a complex of statistical vari-ables into principal components”, Journal of educa-tional psychology, 27 (1933), pp. 417-441.

IEEE Computer Special Issue on Content-based ImageRetrieual Systems, 28 (1995).

K. Karhunen. Uber lineare Meth-oden in der Wahrscheinlichkei tsrechnung. Ann. A cad.Sci. Fennicae, Ser. A137, 1947.

J. Kleinberg, “Two algorithms for nearest-neighborsearch in high dimensions”, these proceedings.

D. Knuth, The Art of Computer Programming, vol. .3,Sorting and Searching, Addison Wesley, 1973.

E. Kreyszig, Introductory Functional Analysis with Ap-plications, Wiley, 1989.

V. Koivune and S. Kassam, “Nearest neighbor filtersfor mukivariate data,” IEEE Workshop on NonlinearSignal and Image Processingl 1995.

N. Linial and O. Saason, Non-Expansive Hashing, InProc. 28th STOC (1996), pp. 509-517.

M. Lo4ve. Fonctions aleastoires de second ordere. Pro-cessus Stochastiques et mouvement Browrzian. Her-mann, Paris, 1948.

J. Matousek, “Reporting points in halfspaces, ” Proc.3.Znd FOCS, 1991,

S. Meiser, “Point location in arrangements of hyper-planes,” Information and Computation (1993), 106, 2,pp. 286-303.

K. Mulmuley, “Randomized multi-dimensional searchtrees: further results in dynamic sampling,” Proc. 32ndFax’, 1991.

Panel on Discriminant Analysis and Clustering, Na-tional Research Council, Discriminant Analysis andClustering, National Academy Press, 1988,

A. Pentland, R.W. Picard, and S. Sclaroff, “Photo-book: tools for content-based manipulation of imagedatabases”, In Proc. SPIE Conference on Stomge andRetrieval of image and Video Databases II, 2185, 1994,

G. Salton, Automatic Text Processing, Addison-Wesley,Reading, MA, 1989.

H. Samet, The Design and Analysis of Spatial DataStructures, Addison-Wesley, Reading, MA, 1989.

A.W.M, Smeulders and R. Jain, editors, ImageDatabases and Multi-media Search, Proceedings of theFirst International Workshop, IDB-MMS ’96, Amster-dam. Amsterdam University Press, 1996.

A.C. Yao and F.F. Yao, “A general approach tod-dimensional geometric queries,” Proc. 17th S7’0(7(1985), pp. 163-168.

625