A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces

A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces

Group 4Seokhwan Eom,

Jungyeol Lee,

Rina You,

Kilho Lee,

Contents

• Introduction

• Observations

• Analysis of NN-search

• VA-file

• Conclusion

2

Presenter: Seokhwan Eom

The Similarity Search Paradigm

( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )3


The Similarity Search Paradigm

Locate closest point to query object, i.e. its nearest neighbor(NN)

( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )4


The conventional approach

• Space-partitioning methods- Gridfile [Nievergelt:1984]

- K-D-B tree [Robinson:1981]

- Quad tree [Finkel:1974]

• Data-partitioning index trees

Unfortunately,

As the number of dimensions increases, their performance degrades.

- The dimensional curse

-R-tree [Guttman:1984]-R*-tree [Beckmann:1990]-SR-tree [Katayama:1997]-TV-tree [Lin:1994]

-R+-tree [Sellis:1987]-X-tree [Berchtold:1996]-M-tree [Ciaccia:1996]-hB-tree [Lomet:1990]

5


Contribution

• Assumptions : initially uniformly-distributed data within unit hypercube with independent dimensions

1. Establish lower bounds on the average performance of NN-search for space- and data-partitioning, and clustering structures.

2. Show formally that any partitioning scheme and clustering technique must degenerate to a sequential scan through all their blocks if the number of dimension is sufficiently large.

3. Present performance results which support their analysis, and demonstrate that the performance of VA-file offers the best performance in practice whenever the number of dimensions is larger than 6.

6


The Difficulties of High Dimensionality

• Observation 1 (Number of partitions)

A simple partitioning scheme :split the data space in each dimension into two halves.

This seems reasonable with low dimensions.

But with d = 100 there are 2100 ≒ 1030 partitions;

even with 106 points, almost all of the partitions(1024) are empty.

7



• Observation 2 (Data space is sparsely populated)

Consider a hyper-cube range query with size s=0.95

At d=100,

Data space Ω=[0,1]d

Target region

s

s

0059.095.0][ 100 dd ssP

8



• Observation 3 (Spherical range queries)The probability that an arbitrary point R lies within the largest spherical query.

Table: Probability that a point lies within the largest range query inside Ω, and the expected database size

Figure: Largest range query entirely within the data space.

9



• Observation 4 (Exponentially growing DB size)The size which a data set would have to have such that, on average, at least one point falls into the sphere (for even d):

Table: Probability that a point lies within the largest range query inside Ω, and the expected database size

)5.0,(Qspd

10



• Observation 5 (Expected NN-distance)

The probability that the NN-distance is at most r(i.e. the probability that NN to query point Q is contained in spd (Q,r)):

The expected NN-distance for a query point Q :

The expected NN-distance E[nndist] for any query point in the data space :

11



• Observation 5 (Expected NN-distance)

1. The NN-distance grows steadily with d

2. Beyond trivially-small data sets D, NN-distances decrease only marginally as the size of D increases.

12


Analysis of NN-Search

• The complexity of any partitioning and clustering scheme converges to with increasing dimensionality

• General Cost Model

• Space-Partitioning Methods

• Data-Partitioning Methods

• General Partitioning and Clustering Schemes

( )O N

13

Presenter: Jungyeol Lee

General Cost Model

• ‘Cost’ of a query:– the number of blocks which must be accessed

• Optimal NN search algorithm: – Blocks visited during the search

= blocks whose MBR1) intersect the NN-sphere

1) MBR: Minimum Bounding Regions

14


General Cost Model

• Let be the number of blocks visited.

• = The number of blocks

which intersect the

• Transform the spherical query into a point query

• Minkowski sum,

visitM

visitM

( , [ ])d distsp Q E nn

( , [ ])dist

iMSum mbr E nn

imbr

[ ]distE nn

( , [ ])dist

iMSum mbr E nn

15


• Transform the spherical query into a point query

• Probability that the -th block must be visit

•

General Cost Model

[ ] ( ( , [ ]) )dist

visit iP i Vol MSum mbr E nn

i

,avg

visit visit

NM P

m

0

[ ]N m

avg

visit visit

i

mP P i

N

16


Space-Partitioning Methods

• Dividing regardless of clusters

• If each dimension is split once,

the total # of partitions: , the space overhead:

• To reduce the space overhead, only dimensions are split such that, on average, points are assigned to a partition

2d (2 )dO'd d

m

'

2 ,d N

m

'

2logN

dm

17



• Let denote the maximum distance from to any point in the data space

• at some dimensionality

• From that dimensionality, Minkowski sum covers the entire data space

• converges into 1 same as sequential scan

'

2logN

dm

'

max 2

1 1log

2 2

Nl d

m

max [ ],distl E nn

maxlimbr

visitP

18



•

• Fig. 7 Comparison of with

[ ] ( ( , [ ]) ) 1dist

visit iP i Vol MSum mbr E nn

maxl [ ]distE nn

19


Data-Partitioning Methods

• Data-partitioning methods partition the data space hierarchically – In order to reduce the search cost from to

• Impracticability of existing methods for NN-search in HDVSs. – A sequential scan out-performed these more sophisticated

hierarchical methods.

N Nlog

20

Presenter: Rina You

Rectangular MBRs

• Index methods use hyper-cubes to bound the region of a block.

• Splitting a node results in two new, equally-full partitions of the data space.

• d’ dimensions are split at high dimensionality

'

2logN

dm

21

Presenter: Rina You

Rectangular MBRs

• rectangular MBR

– d’ sides with a length of 1/2

– d - d’ sides with a length of 1.

• the probability of visiting a block during

NN-search

: the volume of that part of the extended box in the data

space

22

Presenter: Rina You

Rectangular MBRs

• the probability of accessing a block during a NN-search

– different database sizes and different values of d’

23

Presenter: Rina You

Spherical MBRs

• Another group of index structures

– MBRs in the form of hyper-spheres.

• Each block of optimal structure consists of – the center point C

– m - 1 nearest neighbors

• MBR can be described by Cnn msp 1,

24

Presenter: Rina You

Spherical MBRs

• The probability of accessing a block during the search.

• MBRs in the form of hyper-spheres :

• use a Minkowski sum

• The probability that block must be visited during a NN-search

Cnn msp 1,

distmdistd nnEcnnCsp 1,,

distmdistdsp

visit nnEcnnCspVoliP 1,,

i

25

Presenter: Rina You

Spherical MBRs

• another lower bound for this probability

– replace by

• If increases, does not decrease.

–

i idistnn ,

idistjdist nnnnij ,,:

1, mdistnn distdist nnEnn 1,

distdsp

visit nnECspVoliP 2,

26

Presenter: Rina You

Spherical MBRs

• The probability of accessing a block

during the search

– average the above probability over all center points :C

dCnnEcspVolPC

avgsp

visit 2,,

27

Presenter: Rina You

Spherical MBRs

• percentage of blocks visited increases rapidly with the dimensionality

• sequential scan will perform better in practice28

Presenter: Rina You

General Partitioning and Clustering Schemes

• No partitioning or clustering scheme can offer efficient NN-search

– if the number of dimensions becomes large.

• The complexity of methods :

• A large portion (up to 100%) of data blocks must be read

– In order to determine the nearest neighbor.

NO

29

Presenter: Rina You


• Basic assumptions:

1. A cluster is a geometrical form (MBR) that covers all cluster points

2. Each cluster contains at least two points

3. The MBR of a cluster is convex.

30

Presenter: Rina You


• Average probability of accessing a cluster during an NN-search

l

i

iavg

visit CmbrVMl

p1

1

][, distnnExMSumVolxVM

31

Presenter: Rina You


• Lower bound the average probability of accessing a line cluster.

• Pick two arbitrary data points– each cluster contains at least two points

• is contained in– is convex.

• Lower bound the volume of the extended

: iii BAlineVMCmbrVM ,

iCmbr

ii BAline , iCmbr

iCmbr

32

Presenter: Rina You


• Lower bound the distance between

and :

With

– Points in surface of nn-sphere of Ai have minimal minkowski sum for line(Ai, Bi)

– Line(Ai, Pi) is the optimal line cluster for point A

• If Pi is point in surface of nn-sphere of Ai.

)),((min

),(),(

))((ii

AinnsurfQ

iiii

QAlineVM

PAlineVMBAlineVM

sp

))(( isp

i AnnsurfP

iA

iB

33

Presenter: Rina You


• Lower bound the average probability

of accessing a line clusters

– Calculate the average volume of minkowskisums over all possible pairs A and P(A) in the data space

A

l

i

iavg

visit dAAPAlineVMCmbrVMl

P )))(,(())((1

1

34

Presenter: Rina You


• Conclusion 1 (Performance)

– For any clustering and partitioning method, a simple sequential scan performs better.

if the number of dimensions exceeds some d.

• Conclusion 2 (Complexity)

– The complexity of any clustering and partitioning methods tends towards O(N)

as dimensionality increases.

35

Presenter: Rina You


• Conclusion 3 (Degeneration)

– All blocks are accessed

if the number of dimensions exceeds some d

36

Presenter: Rina You

The VA-file

• Accelerates that unavoidable scan by using object approximations to compress the vector data.

• Reduces the amount of data that must be read during similarity searches.

• Compressing vector data

• The filtering step

• Accessing the data

37

Presenter: Kilho Lee

The VA-fileCompressing vector data

• For each dimension i, a small number of bits (bi) is assigned• Let b be the sum of all bi’s, • The data space is divided into 2b

d

i ibb1

b

Nb NShareP

2)21(1][ 1

bd

bicellVolcellinP 2)

2

1()(]"_["

38


The VA-fileFiltering step

( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )

• When searching for the nearest neighbor, the entire approximation file is scanned and upper and lower bounds on the distance to the query

• Let δ is the smallest upper bound found so far.• if a approx has lower bound exceeds δ, it will be filtered.

39


The VA-fileFiltering step

• After the filtering step, less than 0.1% of vectors remaining.

40


The VA-fileAccessing the vector

• After the filtering step, a small set of candidates remain.• candidates are sorted by lower bound• If a lower bound is encountered that exceeds the nearest distance seen so far, the VA-file method stops.

41


The VA-fileAccessing the vector

• less than 1% of vector blocks are visited. • In d = 50, bi = 6, N = 500,000 case, only 20 vectors are accessed.

42


Performance

•Figure depicts the percentage of blocks visited.

43


Conclusion

• conventional indexing methods are out-performed by a simple sequential scan at moderate dimensionality ( d = 10)

• At moderate and high dimensionality ( d ≥ 6 ), the VA-file methodcan out-perform any other method.

44


45

Technology

A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces