Upload
jungyeol
View
580
Download
0
Tags:
Embed Size (px)
Citation preview
A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces
Group 4Seokhwan Eom,
Jungyeol Lee,
Rina You,
Kilho Lee,
Contents
• Introduction
• Observations
• Analysis of NN-search
• VA-file
• Conclusion
2
Presenter: Seokhwan Eom
The Similarity Search Paradigm
( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )3
Presenter: Seokhwan Eom
The Similarity Search Paradigm
Locate closest point to query object, i.e. its nearest neighbor(NN)
( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )4
Presenter: Seokhwan Eom
The conventional approach
• Space-partitioning methods- Gridfile [Nievergelt:1984]
- K-D-B tree [Robinson:1981]
- Quad tree [Finkel:1974]
• Data-partitioning index trees
Unfortunately,
As the number of dimensions increases, their performance degrades.
- The dimensional curse
-R-tree [Guttman:1984]-R*-tree [Beckmann:1990]-SR-tree [Katayama:1997]-TV-tree [Lin:1994]
-R+-tree [Sellis:1987]-X-tree [Berchtold:1996]-M-tree [Ciaccia:1996]-hB-tree [Lomet:1990]
5
Presenter: Seokhwan Eom
Contribution
• Assumptions : initially uniformly-distributed data within unit hypercube with independent dimensions
1. Establish lower bounds on the average performance of NN-search for space- and data-partitioning, and clustering structures.
2. Show formally that any partitioning scheme and clustering technique must degenerate to a sequential scan through all their blocks if the number of dimension is sufficiently large.
3. Present performance results which support their analysis, and demonstrate that the performance of VA-file offers the best performance in practice whenever the number of dimensions is larger than 6.
6
Presenter: Seokhwan Eom
The Difficulties of High Dimensionality
• Observation 1 (Number of partitions)
A simple partitioning scheme :split the data space in each dimension into two halves.
This seems reasonable with low dimensions.
But with d = 100 there are 2100 ≒ 1030 partitions;
even with 106 points, almost all of the partitions(1024) are empty.
7
Presenter: Seokhwan Eom
The Difficulties of High Dimensionality
• Observation 2 (Data space is sparsely populated)
Consider a hyper-cube range query with size s=0.95
At d=100,
Data space Ω=[0,1]d
Target region
s
s
0059.095.0][ 100 dd ssP
8
Presenter: Seokhwan Eom
The Difficulties of High Dimensionality
• Observation 3 (Spherical range queries)The probability that an arbitrary point R lies within the largest spherical query.
Table: Probability that a point lies within the largest range query inside Ω, and the expected database size
Figure: Largest range query entirely within the data space.
9
Presenter: Seokhwan Eom
The Difficulties of High Dimensionality
• Observation 4 (Exponentially growing DB size)The size which a data set would have to have such that, on average, at least one point falls into the sphere (for even d):
Table: Probability that a point lies within the largest range query inside Ω, and the expected database size
)5.0,(Qspd
10
Presenter: Seokhwan Eom
The Difficulties of High Dimensionality
• Observation 5 (Expected NN-distance)
The probability that the NN-distance is at most r(i.e. the probability that NN to query point Q is contained in spd (Q,r)):
The expected NN-distance for a query point Q :
The expected NN-distance E[nndist] for any query point in the data space :
11
Presenter: Seokhwan Eom
The Difficulties of High Dimensionality
• Observation 5 (Expected NN-distance)
1. The NN-distance grows steadily with d
2. Beyond trivially-small data sets D, NN-distances decrease only marginally as the size of D increases.
12
Presenter: Seokhwan Eom
Analysis of NN-Search
• The complexity of any partitioning and clustering scheme converges to with increasing dimensionality
• General Cost Model
• Space-Partitioning Methods
• Data-Partitioning Methods
• General Partitioning and Clustering Schemes
( )O N
13
Presenter: Jungyeol Lee
General Cost Model
• ‘Cost’ of a query:– the number of blocks which must be accessed
• Optimal NN search algorithm: – Blocks visited during the search
= blocks whose MBR1) intersect the NN-sphere
1) MBR: Minimum Bounding Regions
14
Presenter: Jungyeol Lee
General Cost Model
• Let be the number of blocks visited.
• = The number of blocks
which intersect the
• Transform the spherical query into a point query
• Minkowski sum,
visitM
visitM
( , [ ])d distsp Q E nn
( , [ ])dist
iMSum mbr E nn
imbr
[ ]distE nn
( , [ ])dist
iMSum mbr E nn
15
Presenter: Jungyeol Lee
• Transform the spherical query into a point query
• Probability that the -th block must be visit
•
General Cost Model
[ ] ( ( , [ ]) )dist
visit iP i Vol MSum mbr E nn
i
,avg
visit visit
NM P
m
0
[ ]N m
avg
visit visit
i
mP P i
N
16
Presenter: Jungyeol Lee
Space-Partitioning Methods
• Dividing regardless of clusters
• If each dimension is split once,
the total # of partitions: , the space overhead:
• To reduce the space overhead, only dimensions are split such that, on average, points are assigned to a partition
2d (2 )dO'd d
m
'
2 ,d N
m
'
2logN
dm
17
Presenter: Jungyeol Lee
Space-Partitioning Methods
• Let denote the maximum distance from to any point in the data space
• at some dimensionality
• From that dimensionality, Minkowski sum covers the entire data space
• converges into 1 same as sequential scan
'
2logN
dm
'
max 2
1 1log
2 2
Nl d
m
max [ ],distl E nn
maxlimbr
visitP
18
Presenter: Jungyeol Lee
Space-Partitioning Methods
•
• Fig. 7 Comparison of with
[ ] ( ( , [ ]) ) 1dist
visit iP i Vol MSum mbr E nn
maxl [ ]distE nn
19
Presenter: Jungyeol Lee
Data-Partitioning Methods
• Data-partitioning methods partition the data space hierarchically – In order to reduce the search cost from to
• Impracticability of existing methods for NN-search in HDVSs. – A sequential scan out-performed these more sophisticated
hierarchical methods.
N Nlog
20
Presenter: Rina You
Rectangular MBRs
• Index methods use hyper-cubes to bound the region of a block.
• Splitting a node results in two new, equally-full partitions of the data space.
• d’ dimensions are split at high dimensionality
'
2logN
dm
21
Presenter: Rina You
Rectangular MBRs
• rectangular MBR
– d’ sides with a length of 1/2
– d - d’ sides with a length of 1.
• the probability of visiting a block during
NN-search
: the volume of that part of the extended box in the data
space
22
Presenter: Rina You
Rectangular MBRs
• the probability of accessing a block during a NN-search
– different database sizes and different values of d’
23
Presenter: Rina You
Spherical MBRs
• Another group of index structures
– MBRs in the form of hyper-spheres.
• Each block of optimal structure consists of – the center point C
– m - 1 nearest neighbors
• MBR can be described by Cnn msp 1,
24
Presenter: Rina You
Spherical MBRs
• The probability of accessing a block during the search.
• MBRs in the form of hyper-spheres :
• use a Minkowski sum
• The probability that block must be visited during a NN-search
Cnn msp 1,
distmdistd nnEcnnCsp 1,,
distmdistdsp
visit nnEcnnCspVoliP 1,,
i
25
Presenter: Rina You
Spherical MBRs
• another lower bound for this probability
– replace by
• If increases, does not decrease.
–
i idistnn ,
idistjdist nnnnij ,,:
1, mdistnn distdist nnEnn 1,
distdsp
visit nnECspVoliP 2,
26
Presenter: Rina You
Spherical MBRs
• The probability of accessing a block
during the search
– average the above probability over all center points :C
dCnnEcspVolPC
avgsp
visit 2,,
27
Presenter: Rina You
Spherical MBRs
• percentage of blocks visited increases rapidly with the dimensionality
• sequential scan will perform better in practice28
Presenter: Rina You
General Partitioning and Clustering Schemes
• No partitioning or clustering scheme can offer efficient NN-search
– if the number of dimensions becomes large.
• The complexity of methods :
• A large portion (up to 100%) of data blocks must be read
– In order to determine the nearest neighbor.
NO
29
Presenter: Rina You
General Partitioning and Clustering Schemes
• Basic assumptions:
1. A cluster is a geometrical form (MBR) that covers all cluster points
2. Each cluster contains at least two points
3. The MBR of a cluster is convex.
30
Presenter: Rina You
General Partitioning and Clustering Schemes
• Average probability of accessing a cluster during an NN-search
l
i
iavg
visit CmbrVMl
p1
1
][, distnnExMSumVolxVM
31
Presenter: Rina You
General Partitioning and Clustering Schemes
• Lower bound the average probability of accessing a line cluster.
• Pick two arbitrary data points– each cluster contains at least two points
• is contained in– is convex.
• Lower bound the volume of the extended
: iii BAlineVMCmbrVM ,
iCmbr
ii BAline , iCmbr
iCmbr
32
Presenter: Rina You
General Partitioning and Clustering Schemes
• Lower bound the distance between
and :
With
– Points in surface of nn-sphere of Ai have minimal minkowski sum for line(Ai, Bi)
– Line(Ai, Pi) is the optimal line cluster for point A
• If Pi is point in surface of nn-sphere of Ai.
)),((min
),(),(
))((ii
AinnsurfQ
iiii
QAlineVM
PAlineVMBAlineVM
sp
))(( isp
i AnnsurfP
iA
iB
33
Presenter: Rina You
General Partitioning and Clustering Schemes
• Lower bound the average probability
of accessing a line clusters
– Calculate the average volume of minkowskisums over all possible pairs A and P(A) in the data space
A
l
i
iavg
visit dAAPAlineVMCmbrVMl
P )))(,(())((1
1
34
Presenter: Rina You
General Partitioning and Clustering Schemes
• Conclusion 1 (Performance)
– For any clustering and partitioning method, a simple sequential scan performs better.
if the number of dimensions exceeds some d.
• Conclusion 2 (Complexity)
– The complexity of any clustering and partitioning methods tends towards O(N)
as dimensionality increases.
35
Presenter: Rina You
General Partitioning and Clustering Schemes
• Conclusion 3 (Degeneration)
– All blocks are accessed
if the number of dimensions exceeds some d
36
Presenter: Rina You
The VA-file
• Accelerates that unavoidable scan by using object approximations to compress the vector data.
• Reduces the amount of data that must be read during similarity searches.
• Compressing vector data
• The filtering step
• Accessing the data
37
Presenter: Kilho Lee
The VA-fileCompressing vector data
• For each dimension i, a small number of bits (bi) is assigned• Let b be the sum of all bi’s, • The data space is divided into 2b
d
i ibb1
b
Nb NShareP
2)21(1][ 1
bd
bicellVolcellinP 2)
2
1()(]"_["
38
Presenter: Kilho Lee
The VA-fileFiltering step
( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )
• When searching for the nearest neighbor, the entire approximation file is scanned and upper and lower bounds on the distance to the query
• Let δ is the smallest upper bound found so far.• if a approx has lower bound exceeds δ, it will be filtered.
39
Presenter: Kilho Lee
The VA-fileFiltering step
• After the filtering step, less than 0.1% of vectors remaining.
40
Presenter: Kilho Lee
The VA-fileAccessing the vector
• After the filtering step, a small set of candidates remain.• candidates are sorted by lower bound• If a lower bound is encountered that exceeds the nearest distance seen so far, the VA-file method stops.
41
Presenter: Kilho Lee
The VA-fileAccessing the vector
• less than 1% of vector blocks are visited. • In d = 50, bi = 6, N = 500,000 case, only 20 vectors are accessed.
42
Presenter: Kilho Lee
Performance
•Figure depicts the percentage of blocks visited.
43
Presenter: Kilho Lee
Conclusion
• conventional indexing methods are out-performed by a simple sequential scan at moderate dimensionality ( d = 10)
• At moderate and high dimensionality ( d ≥ 6 ), the VA-file methodcan out-perform any other method.
44
Presenter: Kilho Lee
45