1
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as Locality Sensitive Hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string spaces under the Hamming distance. The key novelty of the hashing technique proposed here is that it can be applied to spaces with arbitrary distance measures. First, we describe a domain-independent method for constructing a family of binary hash functions. Then, we use these functions to construct multiple multi-bit hash tables. We show that the LSH formalism is not applicable for analyzing the behavior of these tables as index structures. We present a novel formulation, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method. Experiments on several real-world data sets demonstrate that our method produces good trade- offs between accuracy and efficiency, and significantly outperforms VP-trees, which are a well-known method for distance-based indexing. Hierarchical DBH Rank Queries according to D(Q,N(Q) Divide space into disjoint subsets (equi- height) Train separate indices for each subset Reduce Hash Cost Use small number of “pseudoline” points Database Group Problem NEAREST NEIGHBOR: Given a database S, a distance function D our task is: for a previous unseen query q, locate a point p of the database such that the distance between q and every point o of the database is greater or equal than the distance between p and q. COST MODEL: Minimize number of Distance Computations Computing D may be very expensive Dynamic Time Warping for Time Series Edit Distance Variants for DNA alignment PROBLEM DEFINITION: Define index structure to answer Nearest Neighbor queries efficiently A SOLUTION: Brute Force! Try them all and get the exact answer OUR SOLUTION: Are we willing to trade accuracy for efficiency ? D istance M atrix 0 5 4 … 3 0 ... ... 0 Desired Accuracy TR AIN IN G PHASE D BH Index Structure D BH Index Structure NN .w ith statistical argum ents h Hash Based Indexing Idea: 1. Come up with hash functions that hash similar objects to similar buckets 2. Hash every database object to some buckets 3. At query time apply the same hash function to the query 4. Filter: Retrieve the collisions. The rest of the database is pruned. 5. Refine: Compute actual distances. Return the object with the smallest distance as the NN. query h D D D min Locality Sensitive Hashing Locality Sensitive Family of Functions Amplify the gap between p 1 and p 2 : Randomly pick l hash vectors of k functions each. Probability of collision in at least one of l hash tables: 2 2 1 2 2 1 1 2 1 1 2 1 2 1 2 1 2 1 2 1 Pr , Pr , , , , p x h x h r x x D p x h x h r x x D p p r r p p r r H h H h l k l k p r dist p r dist 2 2 1 1 1 1 Pr 1 1 Pr H using Pseudoline Projections (H DBH ) D(x,x2) D(x1,x2) F(x) x x1 x2 Works on Arbitrary Space but is not Locality Sensitive! Define a line projection function that maps an arbitrary space into the real line R: Real valued Discrete valued: Hash tables should be balanced. Thus t1, t2 are chosen from V: t1 t2 R 1 0 0 2 1 2 2 2 2 1 2 1 , , 2 , , , 2 1 x x D x x D x x D x x D x F x x otherwise t t x F if x F x x x x t t 1 , 0 2 1 , , , 2 1 2 1 2 1 5 . 0 0 Pr , , 2 1 2 1 , , 2 1 2 1 x F t t x x V x x t t X x ACCURACY vs. EFFICIENCY: How often is the actual NN retrieved? How much time does NN retrieval take? Analysis Probability of collision between any two objects: Same probability on a k-bit hash table: Prob of collision in at least one of the l hash tables: Accuracy, i.e. the probability over all queries Q that we will retrieve the nearest neighbor N(Q): LookupCost: Expected number of objects that collide in at least one of the l hash tables HashCost: # of distance computations to evaluate h- functions: Total Cost per query: Efficiency (for all Queries): Use Sampling to estimate Accuracy and Efficiency 1. Sample Queries 2. Sample Database Objects 3. Sample Hash Functions 4. Compute Integrals Finding optimal k & l ..given accuracy (say 90%)…..For k=1,2,… ..compute smallest l that yields required accuracy. Typically, optimal k : last k for which efficiency improves. 2 1 2 1 Pr , x h x h x x C DBH H h k k x x C x x C 2 1 2 1 , , l k l k x x C x x C 2 1 2 1 , , 1 1 , dQ Q Q N Q C Accuracy X Q l k l k Pr , , , U x l k l k x Q C Q LookupCost , , , kl Q HashCost l k 2 , Q HashCost Q LookupCost Q Cost l k l k l k , , , X Q l k l k dQ Q Q Cost Cost Pr , , 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0 100 200 300 400 500 600 700 800 C(Q,N(Q)) Number of Queries Additional Optimizations Experiments 0 d1 d2 d3 D(Q,N(Q)) Conclusion General purpose Distance is black box Does not require metric properties Statistical analysis is possible Even when NN is not returned, a very close N is returned… For many Not sublinear in size of DB Statistical (not probabilistic) Need “representative” sample sets Hands dataset .. actual performance was different than simulation .. – the training set was not r1 r2 B C A

Nearest Neighbor Retrieval Using Distance-Based Hashing

  • Upload
    sinjin

  • View
    27

  • Download
    2

Embed Size (px)

DESCRIPTION

Number of Queries. C(Q,N(Q)). D(Q,N(Q)). Database Group. Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios. Analysis. Probability of collision between any two objects: - PowerPoint PPT Presentation

Citation preview

Page 1: Nearest Neighbor Retrieval Using Distance-Based Hashing

Nearest Neighbor Retrieval Using Distance-Based HashingMichalis Potamias and Panagiotis Papapetrou

supervised by Prof George Kollios A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as Locality Sensitive Hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string spaces under the Hamming distance. The key novelty of the hashing technique proposed here is that it can be applied to spaces with arbitrary distance measures. First, we describe a domain-independent method for constructing a family of binary hash functions. Then, we use these functions to construct multiple multi-bit hash tables. We show that the LSH formalism is not applicable for analyzing the behavior of these tables as index structures. We present a novel formulation, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method. Experiments on several real-world data sets demonstrate that our method produces good trade-offs between accuracy and efficiency, and significantly outperforms VP-trees, which are a well-known method for distance-based indexing.

Hierarchical DBH Rank Queries according to D(Q,N(Q) Divide space into disjoint subsets (equi-height) Train separate indices for each subset

Reduce Hash Cost Use small number of “pseudoline” points

Database Group

ProblemNEAREST NEIGHBOR: Given a database S, a distance function D our task is: for a previous unseen query q, locate a point p of the database such that the distance between q and every point o of the database is greater or equal than the distance between p and q.

COST MODEL: Minimize number of Distance ComputationsComputing D may be very expensiveDynamic Time Warping for Time Series

Edit Distance Variants for DNA alignment

PROBLEM DEFINITION: Define index structure to answer Nearest Neighbor queries efficiently

A SOLUTION: Brute Force! Try them all and get the exact answer

OUR SOLUTION: Are we willing to trade accuracy for efficiency ?

Distance Matrix

0 5 4 … 3 0 …...

... 0

Desired Accuracy

TRAINING PHASEDBH Index Structure

DBH Index Structure

NN….with statistical arguments

h

Hash Based IndexingIdea:1. Come up with hash functions

that hash similar objects to similar buckets

2. Hash every database object to some buckets

3. At query time apply the same hash function to the query

4. Filter: Retrieve the collisions. The rest of the database is pruned.

5. Refine: Compute actual distances. Return the object with the smallest distance as the NN.

query

h

D D D

min

Locality Sensitive HashingLocality Sensitive Family of Functions

Amplify the gap between p1 and p2:

Randomly pick l hash vectors of k functions each. Probability of collision in at least one of l hash tables:

221221

121121

21212121

Pr,

Pr,

,,,

pxhxhrxxD

pxhxhrxxD

pprrpprr

Hh

Hh

lk

lk

prdist

prdist

22

11

11Pr

11Pr

H using Pseudoline Projections (HDBH)

D(x,x2)

D(x1,x2)F(x)

x

x1 x2

Works on Arbitrary Space but is not Locality Sensitive!Define a line projection function that mapsan arbitrary space into the real line R:

Real valued Discrete valued:

Hash tables should be balanced. Thus t1, t2 are chosen from V:

t1

t2 R

10

0

21

22

221

21,

,2

,,,21

xxD

xxDxxDxxDxF xx

otherwise

ttxFifxF

xxxxtt

1

,0 21,

,,

21

21

21

5.00Pr,, 21

21

,,2121 xFttxxV xxttXx

ACCURACY vs. EFFICIENCY: How often is the actual NN retrieved?How much time does NN retrieval take?

Analysis Probability of collision between any two objects:

Same probability on a k-bit hash table:

Prob of collision in at least one of the l hash tables:

Accuracy, i.e. the probability over all queries Q that we will retrieve the nearest neighbor N(Q):

LookupCost: Expected number of objects that collide in at least one of the l hash tables

HashCost: # of distance computations to evaluate h-functions:

Total Cost per query:

Efficiency (for all Queries):

Use Sampling to estimate Accuracy and Efficiency1. Sample Queries2. Sample Database Objects3. Sample Hash Functions4. Compute Integrals

Finding optimal k & l..given accuracy (say 90%)… ..For k=1,2,… ..compute smallest l that yields required accuracy.

Typically, optimal k : last k for which efficiency improves.

2121 Pr, xhxhxxCDBHHh

kk xxCxxC 2121 ,,

lklk xxCxxC 2121, ,11,

dQQQNQCAccuracyXQ lklk Pr,,,

Ux

lklk xQCQLookupCost ,,,

klQHashCost lk 2,

QHashCostQLookupCostQCost lklklk ,,,

XQ lklk dQQQCostCost Pr,,

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

100

200

300

400

500

600

700

800

C(Q,N(Q))

Num

ber

of Q

uerie

s

Additional Optimizations

Experiments

0 d1 d2 d3 D(Q,N(Q))

Conclusion General purpose Distance is black box Does not require metric properties Statistical analysis is possible Even when NN is not returned, a

very close N is returned… For many applications that’s fine!!

Not sublinear in size of DB Statistical (not probabilistic)

Need “representative” sample sets Hands dataset .. actual

performance was different than simulation ..

– the training set was not representative!

r1r2 B

C

A