22
Nearest Neighbor Retrieval Using Distance- Based Hashing Vassilis Athitsos Michalis Potamias + University of Texas, Arlington Boston University Panagiotis Papapetrou George Kollios Boston University Boston University

Nearest Neighbor Retrieval Using Distance-Based Hashing

Embed Size (px)

DESCRIPTION

Nearest Neighbor Retrieval Using Distance-Based Hashing. Vassilis Athitsos Michalis Potamias + University of Texas, Arlington Boston University Panagiotis Papapetrou George Kollios Boston University Boston University. nearest neighbor problem. Setting: - PowerPoint PPT Presentation

Citation preview

Page 1: Nearest Neighbor Retrieval Using Distance-Based Hashing

Nearest Neighbor Retrieval Using Distance-Based Hashing

Vassilis Athitsos Michalis Potamias+

University of Texas, Arlington Boston University

Panagiotis Papapetrou George Kollios Boston University Boston University

Page 2: Nearest Neighbor Retrieval Using Distance-Based Hashing

04/19/23 Distance Based Hashing 2

nearest neighbor problem

Setting: database of objects S distance function D

Given: query Q (previously unseen)

Find and Return: object P* from S, that is closest to Q

NNs appear in various applications under many different distance functions classification of handwritten digits hand-pose estimation

Can perform linear scan… Cost

large S expensive D

Page 3: Nearest Neighbor Retrieval Using Distance-Based Hashing

04/19/23 Distance Based Hashing 3

cost model

Dominating cost: Distance function may be very “expensive” Time series (DTW) String Alignment (Edit) Computer vision

Cost Model: minimize number of distance computations Dynamic Programming

for Edit Distance

Page 4: Nearest Neighbor Retrieval Using Distance-Based Hashing

04/19/23 Distance Based Hashing 4

some existing solutions

If objects are low dimensional, exact nearest neighbors are fast

If objects are high dimensional, for some distance functions (Hamming) approximate nearest neighbors are fast, using LSH

However in many interesting settings “linear scan” may be the only approach for exact NNs high dimensional, non-metric

Page 5: Nearest Neighbor Retrieval Using Distance-Based Hashing

04/19/23 Distance Based Hashing 5

dbh setting

No assumptions for the distance function probably non-metric

Distance function computations dominate the cost

Trade perfect accuracy for faster results

Page 6: Nearest Neighbor Retrieval Using Distance-Based Hashing

04/19/23 Distance Based Hashing 6

dbh method overview

Preprocess: Hash database using appropriate functions

Query Q arrives: Hash it! Filter: Retrieve colliding objects as “candidate

NNs” Refine: Compute the actual distance between

query and candidates Return: Candidate that is closest to Q

Page 7: Nearest Neighbor Retrieval Using Distance-Based Hashing

Background

Page 8: Nearest Neighbor Retrieval Using Distance-Based Hashing

04/19/23 Distance Based Hashing 8

Background: hash – based indexing

D D D

min

query

Building the index

Query Time

Use L tables in parallel } … L

database

h1

h1

h2

hL

h1

Page 9: Nearest Neighbor Retrieval Using Distance-Based Hashing

04/19/23 Distance Based Hashing 9

Background: locality sensitive hashing

Choice of Hash Functions is important! LSH family of functions [IM98]

An LSHF in a Hash-based Indexing scheme guarantees sublinear behavior for approximate NNs!

Such families have been constructed for Hamming, L2…

What if there is no LSH family for the Distance function used? Edit, DTW etc.

xr

cr

z

y

Page 10: Nearest Neighbor Retrieval Using Distance-Based Hashing

Distance Based Hashing

Hash based Indexing schemeCan be applied to any space & any DIts hash functions treat D as a black boxOptimization

Page 11: Nearest Neighbor Retrieval Using Distance-Based Hashing

04/19/23 Distance Based Hashing 11

DBH: family of hash functions Pseudo-Line projection [FL95]

maps an object into the real line y,z are pivot-points from the

database Project x on the y-z pseudoline Use a threshold to make it

discrete valued

- - This family is not an LSHF ++ Definition does not depend

on the specific distance function, only on the 3 pairwise distances.

x

y zD(y,z)

F (x)y,z

zyD

zxDzyDyxDxF zy

,2

,,, 222,

Page 12: Nearest Neighbor Retrieval Using Distance-Based Hashing

04/19/23 Distance Based Hashing 12

DBH: method

Preprocessing:1. Use a random choice of K of these pseudoline

projections to define a hash function2. Build L such (K-bit) functions3. Hash all objects of S to the L h-tables

At query time:1. Apply the same L functions to Q2. Filter : Retrieve colliding objects (candidate set)3. Refine: Invoke D for candidates4. Return: Nearest*

Page 13: Nearest Neighbor Retrieval Using Distance-Based Hashing

04/19/23 Distance Based Hashing 13

DBH: accuracy vs cost

Accuracy : Percentage of queries for which DBH returns true NN Cost: Amount of distance computations Problem: Given desired accuracy minimize the cost Choice of K,L affects the cost and the accuracy Sampling: approximate distributions

Probability of NNs colliding Probability of non-NNs colliding

Perform binary search for best (K,L)

Distance Matrix

0 5 4 … 3 0 …...

... 0

Desired Accuracy

TRAINING PHASEDBH Index Structure

K, L

Page 14: Nearest Neighbor Retrieval Using Distance-Based Hashing

04/19/23 Distance Based Hashing 14

DBH: accuracy

Probability of collision between any Query Q and its Nearest Neighbor N(Q) for a single projection function

Employ sampling to estimate C(Q,N(Q))

Use K and L to shift distribution to desired accuracy Probability of collision in at

least one of the L K-bit tables

…and compute

LKLK QNQCQNQC ,11,,

QNhQhQNQCDBHHh Pr,

dQQQNQCAccuracyXQ LKLK Pr,,,

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

100

200

300

400

500

600

700

800

Page 15: Nearest Neighbor Retrieval Using Distance-Based Hashing

04/19/23 Distance Based Hashing 15

DBH: cost

Hash and LookUp

HashCost: Number of distance computations to evaluate hash functions

LookupCost: number of objects that collide in at least one of the L hash tables

Query Cost:

Total Cost (for all Queries):

Ux

LKLK xQCQLookupCost ,,,

KLQHashCost LK 2,

QHashCostQLookupCostQCost LKLKLK ,,,

XQ LKLK dQQQCostCost Pr,,

D(x,z)

D(x

,y)

x

y zD(y,z)

F (x)y,z

D D D

min

Page 16: Nearest Neighbor Retrieval Using Distance-Based Hashing

04/19/23 Distance Based Hashing 17

DBH: further optimization

1. Hierarchical DBH Build M parallel DBH indices for different

subsets of queries Partition according to distribution D(Q,N(Q)) Queries that are close to their NN are “easier”

2. Reduce HashCost by restricting HDBH to a small subset of database pivot-points for the projections

Page 17: Nearest Neighbor Retrieval Using Distance-Based Hashing

Experiments

Page 18: Nearest Neighbor Retrieval Using Distance-Based Hashing

04/19/23 Distance Based Hashing 19

experiments: datasets

We test DBH on 3 datasets:

Unipen (timeseries ~30 – digits) Dynamic Time Warping 10K (test: 5K)

MNIST (images 28x28 – digits) Shape Context Matching 60K (test: 10K)

Hands (images 256x256 – hand-pose) Chamfer Distance 80K (test: 1K)

Page 19: Nearest Neighbor Retrieval Using Distance-Based Hashing

04/19/23 Distance Based Hashing 20

experiments: results

Training-set to opt K, L

Test-set experiment

Compare to modified VP-tree handles non-metric data

Accuracy vs Cost plot X-axis : Accuracy Y-axis : Distance Computations

Page 20: Nearest Neighbor Retrieval Using Distance-Based Hashing

04/19/23 Distance Based Hashing 21

experiments: results

Page 21: Nearest Neighbor Retrieval Using Distance-Based Hashing

04/19/23 Distance Based Hashing 22

conclusion

Distance Based Hashing is a hash-based indexing framework for NN retrieval Not sublinear, just speedup General purpose: No properties assumed for distance

function - black box May be further optimized for bigger speedups

Future: Can we build a scheme for “black box” distance function and provide a statistical argument for sublinear behavior to the size of the database?

Page 22: Nearest Neighbor Retrieval Using Distance-Based Hashing

04/19/23 Distance Based Hashing 23

thank you!

Famous NNs : Castor (Κάστωρ) and Polydeuces (Πολυδεύκης)