View
3
Download
0
Category
Preview:
Citation preview
Near Neighbor Problem Made Fair
Sariel Har-PeledUIUC
Sepideh MahabadiTTIC
Nearest Neighbor Problems
⢠Nearest Neighbor: Given a set of objects, find the closest one to the query object.
Nearest Neighbor Problems
⢠Nearest Neighbor: Given a set of objects, find the closest one to the query object.
Nearest Neighbor Problems
⢠Nearest Neighbor: Given a set of objects, find the closest one to the query object.
⢠Near Neighbor: given a set of objects, find one that is close enough to the query object.
There are many applications of NN
Searching for the closest object
Near Neighbor
Dataset of ðð points ðð in a metric space, e.g. âðð, and a parameter ðð
Near Neighbor
Dataset of ðð points ðð in a metric space, e.g. âðð, and a parameter ððA query point ðð comes online
ðð
ðð
Near Neighbor
Dataset of ðð points ðð in a metric space, e.g. âðð, and a parameter ððA query point ðð comes online
Goal: ⢠Find a point ððâ in the ðð-neighborhood
ððððâ
ðð
Near Neighbor
Dataset of ðð points ðð in a metric space, e.g. âðð, and a parameter ððA query point ðð comes online
Goal: ⢠Find a point ððâ in the ðð-neighborhood⢠Do it in sub-linear time and small space
ððððâ
ðð
Near Neighbor
Dataset of ðð points ðð in a metric space, e.g. âðð, and a parameter ððA query point ðð comes online
Goal: ⢠Find a point ððâ in the ðð-neighborhood⢠Do it in sub-linear time and small space
All existing algorithms for this problem⢠Either space or query time depending exponentially on ðð⢠Or assume certain properties about the data, e.g., bounded
intrinsic dimension
ððððâ
ðð
Approximate Near Neighbor
Dataset of ðð points ðð in a metric space, e.g. âðð, and a parameter ððA query point ðð comes online
Goal: ⢠Find a point ððâ in the ðð-neighborhood⢠Do it in sub-linear time and small space⢠Approximate Near Neighbor
â Report a point in distance cðð for c > 1
ðððð
ðððððð
Approximate Near Neighbor
Dataset of ðð points ðð in a metric space, e.g. âðð, and a parameter ððA query point ðð comes online
Goal: ⢠Find a point ððâ in the ðð-neighborhood⢠Do it in sub-linear time and small space⢠Approximate Near Neighbor
â Report a point in distance cðð for c > 1â For Hamming (and Manhattan) query time is ðððð(1/ðð) [IM98] â and for Euclidean it is ðððð( 1
ðð2) [AI08]
ðð
ðððððððð
Fair Near Neighbor
Sample a neighbor of the query uniformly at random
Individual fairness: every neighbor has the same chance of being reported. Remove the bias inherent in the NN data structure
Fair Near Neighbor
Sample a neighbor of the query uniformly at random
Individual fairness: every neighbor has the same chance of being reported. Remove the bias inherent in the NN data structure
Applications:Removing noise, k-NN classificationAnonymizing the dataCounting the neighborhood size
Fair Near Neighbor
Dataset of ðð points ðð in a metric space, e.g. âðð, and a parameter ððA query point ðð comes online
Goal: ⢠Return each point ðð in the neighborhood of ðð with uniform
probability⢠Do it in sub-linear time and small space
ðð12
ðð 12
Approximate Fair Near Neighbor
Dataset of ðð points ðð in a metric space, e.g. âðð, and a parameter ððA query point ðð comes online
Goal of Approximate Fair NNâ Any point ðð in ðð(ðð, ðð) is reported with âalmost uniformâ
probability, i.e., ðððð(ðð) where
11 + ðð ðð ðð, ðð
†ðððð(ðð) â€1 + ðððð ðð, ðð
ðð12
ðð 12
Results on (1 + ðð)-Approximate Fair NN
ðððŽðŽðŽðŽðŽðŽ and ðððŽðŽðŽðŽðŽðŽ are the space and query time of standard ANN
Domain Guarantee Space Query
Exact Neighborhoodðð(ðð, ðð)
w.h.p ðð(ðððŽðŽðŽðŽðŽðŽ) ï¿œðð(ðððŽðŽðŽðŽðŽðŽ â ðð ðð, ðððððð ðð, ðð
)
ðð
ðð
Results on (1 + ðð)-Approximate Fair NN
ðððŽðŽðŽðŽðŽðŽ and ðððŽðŽðŽðŽðŽðŽ are the space and query time of standard ANN
Approximate neighborhood: a set ðð such that ðð ðð, ðð â ðð â ðð ðð, ðððð
Domain Guarantee Space Query
Exact Neighborhoodðð(ðð, ðð)
w.h.p ðð(ðððŽðŽðŽðŽðŽðŽ) ï¿œðð(ðððŽðŽðŽðŽðŽðŽ â ðð ðð, ðððððð ðð, ðð
)
Approximate Neighborhoodðð ðð, ðð â ðð â ðð(ðð, ðððð)
In expectation ðð(ðððŽðŽðŽðŽðŽðŽ) ï¿œðð(ðððŽðŽðŽðŽðŽðŽ)
ðð
ðððð
ðððððð
Results on (1 + ðð)-Approximate Fair NN
ðððŽðŽðŽðŽðŽðŽ and ðððŽðŽðŽðŽðŽðŽ are the space and query time of standard ANN
Approximate neighborhood: a set ðð such that ðð ðð, ðð â ðð â ðð ðð, ðððð
Dependence on ðð is O(log(1ðð))
Domain Guarantee Space Query
Exact Neighborhoodðð(ðð, ðð)
w.h.p ðð(ðððŽðŽðŽðŽðŽðŽ) ï¿œðð(ðððŽðŽðŽðŽðŽðŽ â ðð ðð, ðððððð ðð, ðð
)
Approximate Neighborhoodðð ðð, ðð â ðð â ðð(ðð, ðððð)
In expectation ðð(ðððŽðŽðŽðŽðŽðŽ) ï¿œðð(ðððŽðŽðŽðŽðŽðŽ)
Results on (1 + ðð)-Approximate Fair NN
ðððŽðŽðŽðŽðŽðŽ and ðððŽðŽðŽðŽðŽðŽ are the space and query time of standard ANN
Approximate neighborhood: a set ðð such that ðð ðð, ðð â ðð â ðð ðð, ðððð
Dependence on ðð is O(log(1ðð))
Experiments
Domain Guarantee Space Query
Exact Neighborhoodðð(ðð, ðð)
w.h.p ðð(ðððŽðŽðŽðŽðŽðŽ) ï¿œðð(ðððŽðŽðŽðŽðŽðŽ â ðð ðð, ðððððð ðð, ðð
)
Approximate Neighborhoodðð ðð, ðð â ðð â ðð(ðð, ðððð)
In expectation ðð(ðððŽðŽðŽðŽðŽðŽ) ï¿œðð(ðððŽðŽðŽðŽðŽðŽ)
Results on (1 + ðð)-Approximate Fair NN
ðððŽðŽðŽðŽðŽðŽ and ðððŽðŽðŽðŽðŽðŽ are the space and query time of standard ANN
Approximate neighborhood: a set ðð such that ðð ðð, ðð â ðð â ðð ðð, ðððð
Dependence on ðð is O(log(1ðð))
Experiments
Recent paper [Aumuller, Pagh, Silvestryâ19] defining the same notion
Domain Guarantee Space Query
Exact Neighborhoodðð(ðð, ðð)
w.h.p ðð(ðððŽðŽðŽðŽðŽðŽ) ï¿œðð(ðððŽðŽðŽðŽðŽðŽ â ðð ðð, ðððððð ðð, ðð
)
Approximate Neighborhoodðð ðð, ðð â ðð â ðð(ðð, ðððð)
In expectation ðð(ðððŽðŽðŽðŽðŽðŽ) ï¿œðð(ðððŽðŽðŽðŽðŽðŽ)
Locality Sensitive Hashing (LSH)One of the main approaches to solve the Nearest Neighbor problems
Hashing scheme s.t. close points have higher probability of collision than far points
Locality Sensitive Hashing (LSH)
Hashing scheme s.t. close points have higher probability of collision than far pointsHash functions: ðð1 , ⊠,ððð¿ð¿
⢠ðððð is an independently chosen hash function
Locality Sensitive Hashing (LSH)
ðð1
ðð2
ðð3
ððð¿ð¿
Hashing scheme s.t. close points have higher probability of collision than far pointsHash functions: ðð1 , ⊠,ððð¿ð¿
⢠ðððð is an independently chosen hash function
If ðð â ððⲠ†ðð , they collide w.p. ⥠ððâððððâIf ðð â ððⲠ⥠ðððð , they collide w.p. †ðððððððð
For ððâððððâ ⥠ðððððððð
Locality Sensitive Hashing (LSH)
ðð1
ðð2
ðð3
ððð¿ð¿
Retrieval: [Indyk, Motwaniâ98]⢠The union of the query buckets is roughly
the neighborhood of ðð
⢠âðð ðµðµðð ðððð ðð is roughly the neighborhood
Locality Sensitive Hashing (LSH)
ðð
ðð1
ðð2
ðð3
ððð¿ð¿
Retrieval: [Indyk, Motwaniâ98]⢠The union of the query buckets is roughly
the neighborhood of ðð
⢠âðð ðµðµðð ðððð ðð is roughly the neighborhood
⢠How to report a uniformly random neighbor from union of these buckets?
Locality Sensitive Hashing (LSH)
ðð
ðð1
ðð2
ðð3
ððð¿ð¿
Retrieval: [Indyk, Motwaniâ98]⢠The union of the query buckets is roughly
the neighborhood of ðð
⢠âðð ðµðµðð ðððð ðð is roughly the neighborhood
⢠How to report a uniformly random neighbor from union of these buckets?
⢠Collecting all points might take ðð(ðð) time
Locality Sensitive Hashing (LSH)
ðð
ðð1
ðð2
ðð3
ððð¿ð¿
Approaches
How to output a random neighbor from âðð ðµðµðð ðððð ðð :
1. Choose a uniformly random bucket2. Choose a uniformly random point in the
bucket
Approach 1: Uniform/Uniform
ðð
ðð1
ðð2
ðð3
ððð¿ð¿
How to output a random neighbor from âðð ðµðµðð ðððð ðð :
1. Choose a random bucket proportional to its size
2. Choose a random point in the bucket
Approach 2: Weighted/Uniform
ðð
ðð1
ðð2
ðð3
ððð¿ð¿
How to output a random neighbor from âðð ðµðµðð ðððð ðð :
1. Choose a random bucket proportional to its size
2. Choose a random point in the bucket Each point ðð in the neighborhood is picked
w.p. proportional to its degree ðððð
Approach 2: Weighted/Uniform
ðð
Number of buckets that ððappears in
ðð1
ðð2
ðð3
ððð¿ð¿
How to output a random neighbor from âðð ðµðµðð ðððð ðð :
1. Choose a random bucket proportional to its size
2. Choose a random point in the bucket Each point ðð in the neighborhood is picked
w.p. proportional to its degree ðððð
3. Keep ðð with probability 1ðððð
, o.w. repeat
Approach 3: Optimal
ðð
ðð1
ðð2
ðð3
ððð¿ð¿
1. Choose a random bucket proportional to its size
2. Choose a random point in the bucket Each point ðð in the neighborhood is picked
w.p. proportional to its degree ðððð
3. Keep ðð with probability 1ðððð
, o.w. repeat
Uniform probability
ðð
Approach 3: Optimal
ðð1
ðð2
ðð3
ððð¿ð¿
1. Choose a random bucket proportional to its size
2. Choose a random point in the bucket Each point ðð in the neighborhood is picked
w.p. proportional to its degree ðððð
3. Keep ðð with probability 1ðððð
, o.w. repeat
Uniform probability Need to spend ðð(ð¿ð¿) to find the degree
ðð
Approach 3: Optimal
ðð1
ðð2
ðð3
ððð¿ð¿
1. Choose a random bucket proportional to its size
2. Choose a random point in the bucket Each point ðð in the neighborhood is picked
w.p. proportional to its degree ðððð
3. Keep ðð with probability 1ðððð
, o.w. repeat
Uniform probability Need to spend ðð(ð¿ð¿) to find the degree Might need ðð ðððððððð = ðð(ð¿ð¿) samples Total time is ðð(ð¿ð¿2)
ðð
Approach 3: Optimal
ðð1
ðð2
ðð3
ððð¿ð¿
Approximate the degree ððððSample ðð( ð¿ð¿
ððððâ ðð2) buckets out of ð¿ð¿ buckets to (1 + ðð)-approximate the degree.
Still if the degree is low this takes ðð(ð¿ð¿) samples.
Approximate the degree ððððSample ðð( ð¿ð¿
ððððâ ðð2) buckets out of ð¿ð¿ buckets to (1 + ðð)-approximate the degree.
Still if the degree is low this takes ðð(ð¿ð¿) samples.
Case 1: Small degree ð ð ðð:⢠More samples are required to estimate⢠Reject with lower probability -> Fewer queries of this type
Case 2: Large degree ð ð ðð:⢠Fewer samples are required to estimate⢠Reject with higher probability -> More queries of this type
Approximate the degree ððððSample ðð( ð¿ð¿
ððððâ ðð2) buckets out of ð¿ð¿ buckets to (1 + ðð)-approximate the degree.
Still if the degree is low this takes ðð(ð¿ð¿) samples.
Case 1: Small degree ð ð ðð:⢠More samples are required to estimate⢠Reject with lower probability -> Fewer queries of this type
Case 2: Large degree ð ð ðð:⢠Fewer samples are required to estimate⢠Reject with higher probability -> More queries of this type
This decreases ðð(ð¿ð¿2) runtime to ï¿œðð(ð¿ð¿) Large dependency on ðð of the form ðð( 1
ðð2)
Via a different sampling approach we show how to reduce the dependency to logarithmic ðð(log 1
ðð).
Experiments
Setup⢠Take MNIST as the data set⢠Ask a query several times and compute the empirical distribution of the
neighbors.⢠Compute the statistical distance of the empirical distribution to the uniform
distribution
Experiments
Setup⢠Take MNIST as the data set⢠Ask a query several times and compute the empirical distribution of the
neighbors.⢠Compute the statistical distance of the empirical distribution to the uniform
distributionComparison⢠Our algorithm performs 2.5 times worse than the optimal algorithm, but the
other two perform 7 and 10 times worse than the optimal.⢠Four times faster than the optimal but 15 times slower than the other two
Conclusion
Domain Guarantee Space Query
Exact Neighborhoodðð(ðð, ðð)
w.h.p ðð(ðððŽðŽðŽðŽðŽðŽ) ï¿œðð(ðððŽðŽðŽðŽðŽðŽ â ðð ðð, ðððððð ðð, ðð
)
Approximate Neighborhoodðð ðð, ðð â ðð â ðð(ðð, ðððð)
In expectation ðð(ðððŽðŽðŽðŽðŽðŽ) ï¿œðð(ðððŽðŽðŽðŽðŽðŽ)
The dependence on the parameter ðð matches the standard Nearest Neighbor.We get an independent near neighbor each time we draw a sample. More generally the approach works for sampling form a sub-
collection of sets.
Conclusion
Domain Guarantee Space Query
Exact Neighborhoodðð(ðð, ðð)
w.h.p ðð(ðððŽðŽðŽðŽðŽðŽ) ï¿œðð(ðððŽðŽðŽðŽðŽðŽ â ðð ðð, ðððððð ðð, ðð
)
Approximate Neighborhoodðð ðð, ðð â ðð â ðð(ðð, ðððð)
In expectation ðð(ðððŽðŽðŽðŽðŽðŽ) ï¿œðð(ðððŽðŽðŽðŽðŽðŽ)
The dependence on the parameter ðð matches the standard Nearest Neighbor.We get an independent near neighbor each time we draw a sample. More generally the approach works for sampling form a sub-
collection of sets.Open Problem:
o Finding the optimal dependency on the density parameter: ðŽðŽ ðð,ðððððŽðŽ ðð,ðð
Conclusion
Domain Guarantee Space Query
Exact Neighborhoodðð(ðð, ðð)
w.h.p ðð(ðððŽðŽðŽðŽðŽðŽ) ï¿œðð(ðððŽðŽðŽðŽðŽðŽ â ðð ðð, ðððððð ðð, ðð
)
Approximate Neighborhoodðð ðð, ðð â ðð â ðð(ðð, ðððð)
In expectation ðð(ðððŽðŽðŽðŽðŽðŽ) ï¿œðð(ðððŽðŽðŽðŽðŽðŽ)
ThanksQuestions?
The dependence on the parameter ðð matches the standard Nearest Neighbor.We get an independent near neighbor each time we draw a sample. More generally the approach works for sampling form a sub-
collection of sets.Open Problem:
o Finding the optimal dependency on the density parameter: ðŽðŽ ðð,ðððððŽðŽ ðð,ðð
Recommended