24
Indexing inexact proximity search with distance regression in pivot space Ole Edsberg Magnus Lie Hetland Department of Computer and Information Science Norwegian University of Science and Technology Similarity Search and Applications, 2010

Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Indexing inexact proximity search withdistance regression in pivot space

Ole Edsberg Magnus Lie Hetland

Department of Computer and Information ScienceNorwegian University of Science and Technology

Similarity Search and Applications, 2010

Page 2: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

The problem

Our task: Speed up proximity search in cases where:I Distance calculation is expensive.I Distance-based indexing is needed, because the contents

of the data objects cannot be used in the index.I Some search inexactness is acceptable, meaning we are

allowed to trade som search accuracy in return for reducedsearch computation cost.

Our contribution: A new indexing scheme that in some casesprovides better computation / accuracy trade-offs than thecompetition. It also has some draw-backs.

Page 3: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Related work

E. Chávez, K. Figueroa, and G. Navarro.Effective proximity retrieval by ordering permutations.IEEE Transactions on Pattern Analysis and MachineIntelligence, 30(9):1647–1658, 2008.

Takeaway:I How to perform inexact search by ordering the database

according to promise value function.I A specific promise value function, which we use as a

baseline in our experiments.I Experimental setup.

Page 4: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Pivot space

Pivot set:P = (p1,p2, . . . ,pm)

Mapping object o to pivot space.

Φ(o) = (d(o,p1),d(o,p2), . . . ,d(o,pm))

Page 5: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

The baseline: Permutation based promise values

The promise value for indexed object u with respect to query qis the correlation (rank correlation coefficient, Spearman’s ρ)between the ordering permutation of Φ(u) and that of Φ(q).

Page 6: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

What makes a good promise value function?

Page 7: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Two ideas for promise value functions

Distance estimated̂(u,q)

Probability of inclusion

Pr(d(u,q) ≤ r)

Page 8: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Uncertainty in distance estimates

d̂(u,q)

distance to query

prob

abili

tyde

nsity

Page 9: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Should red or blue object be visited first?

r

distance to query

prob

abili

tyde

nsity

Page 10: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Should red or blue object be visited first?

r

distance to query

prob

abili

tyde

nsity

Page 11: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Linear model of distance for indexed object u

d(u,q) = β〈u,0〉 +m∑

i=1

β〈u,i〉d(q,pi) + εu ,

Page 12: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Regression-based index (one model per object!)

With n objects to index and m pivots, an n × (m + 1) matrix:β̂〈u1,0〉, β̂〈u1,1〉, . . . , β̂〈u1,m〉β̂〈u2,0〉, β̂〈u2,1〉, . . . , β̂〈u2,m〉

. . .

β̂〈un,0〉, β̂〈un,1〉, . . . , β̂〈un,m〉

I The coefficients can be discretized to save space.I Plus 2n + 2 additional values if we want probabilities.

Page 13: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Building the index

1. Select n′ training queries from the objects to be indexed.2. For each training query q′, calculate Φ(q′) .3. For each object to be indexed u:

3.1 For each training query q′, calculate d(u,q′).3.2 Solve the least squares linear regression problem to find

the m + 1 coefficients βu.3.3 Store the coefficients in the index.3.4 If we want probabilities, also store σ̂u, the estimated

standard deviation of εu:

σ̂u =

√∑n′

i=1(d(u,q′i )− d̂(u,q′

i ))2

n′ −m − 1

4. If we want probabilities for the k -NN queries, also store theestimated search radius for each k .

(Detail glossed over in this presentation: we exclude u from thetraining queries used to fit its own model.)

Page 14: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Distance estimates as promise values

d̂(u,q) = β̂〈u,0〉 +m∑

i=1

β̂〈u,i〉d(q,pi)

Page 15: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Probability-based promise values

r − d̂(u,q)

σ̂u

I Depends on a lot of assumptions.I We also ignore the consequence of excluding u from its

own training queries.

Page 16: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Storage costs

With n objects to index and m pivots,I For distance estimates: n(m + 1) coefficients. (Can be

discretized at the cost of some accuracy.)I For probabilities: 2n + 2 additional values.

I Permutation-based index: nmdlog2(m)e bits in total.

Page 17: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Index building costs

With n objects to index, m pivots and n′ training queries,I Regression-based scheme: n′(n + m) distance calculations

plus the solution of n linear regression problems.I Permutation-based scheme: nm distance calculations, plus

some sorting.

Page 18: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Experimental setup

I We borrowed the experimental setup from Chávez,Figueroa & Navarro’s evaluation of the permutation-basedscheme.

I Pivots selected randomly.I Also evaluated versions with pivot set reduced to make

storage cost equal to permutation-based index.I Both synthetic and real-world data sets, but results on

real-world data may have more validity.

Page 19: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Evaluating promise value functions:computation / accuracy trade-offs

Computation cost (%)

Sea

rch

accu

racy

(%)

Perfect orderingsHypothetical AHypothetical B

Random orderings

(Average over many queries.)

Page 20: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Results on normalized edit distance (NED)

10−1 100 101 1020

50

100

% of max dist. calc.

%re

trie

ved

(a) 32 pivots

10−1 100 101 102

20

40

60

80

100

% of max dist. calc.

%re

trie

ved

(b) 128 pivots

Permutations Distance estimates (16 bits unfair)Probabilities (16 bits unfair) Distance estimates (16 bits fair)

Distance estimates (12 bits fair)

Page 21: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Results on documents (TREC)

100 101 102

40

60

80

100

% of max dist. calc.

%re

trie

ved

(a) Range-queries

0 0.1 0.20

50

100

wall clock time (seconds)

%re

trie

ved

(b) 5-NN queries

Permutations Distance estimates (16 bits unfair)Probabilities (16 bits unfair) Distance estimates (16 bits fair)

Distance estimates (12 bits fair)

Page 22: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Results on face images (FERET)

0 10 200

0.1

0.2

k%of

max

dist

ance

calc

ulat

ions (a)

0 10 20

1

2

3·10−3

k

wal

lclo

cktim

e(s

econ

ds) (b)

Permutations Distance estimates (16 bits unfair)Probabilities (16 bits unfair) Distance estimates (16 bits fair)

Distance estimates (12 bits fair)

Page 23: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Why were the probability-based promise values sometimesworse, and never better, than the distance estimates?

Page 24: Indexing inexact proximity search with distance regression ... · I Distance-based indexing is needed, because the contents of the data objects cannot be used in the index. I Some

Conclusion

Regression-based scheme show some promise, but:I Takes a lot of time to build the index.I Vulnerable to deviation from assumptions.