Upload
arline-lawrence
View
217
Download
0
Embed Size (px)
DESCRIPTION
3 Cluster Partitioning Based B + -tree
Citation preview
1
Queryy Sampling Based High Dimensional Hybrid Index
Junqi Zhang, Xiangdong ZhouFudan University
2
Nearest Neighbors Query
Dims Overlap Accessed
O3
Q
Query cover area
O1
O2
r1
P1
Q
Query cover area
O1P1
r2
O2
3
Cluster Partitioning Based B+-tree
Oi
Cluster i
Oi
Core Sub-cluster i Marginal Sub-cluster i
split
rrc
Oi
Cluster splitting
4
Index Structure
Oi Oi Oi
Q
Oi
Leafe nodes of B+-tree
Marginal sub-cluster1
Corel sub-cluster1Corel sub-cluster i
Marginal sub-cluster i
Query cover area
C x 1 C x 2 C x i C x ( i+1)
...
Index key
Q
where C is a hash factor
What’s the optimal extent to partition ?
iDistance : by experiments Ours: by cost model to predict
5
Object of Cluster Partitioning - Lowest Query Cost Appropriate M :
Distribute M to each cluster
Overall number of clusters :
HuNNMQKNNNodes c
M2)))(((minarg
HuNN opt
2
6
Dimension Curse dim>10 : tree<scan< VA-file
dim<10 : tree>scan> VA-file
Non uniform : tree VA-file
VA-file defectHow to improve tree performance ?
7
Tree and scan—which better ? tree advantage : filter data instead of linear scan the whole file disadvantage : position cost for each data is the height of
intermediate nodes,which is higher than scan
scan advantage : position cost for each data is 0 disadvantage : linear scan the whole file
8
Cost that view from each point
(C<1) : tree - useful - compared with scan
( C>=1) : tree - useless - compared with scan
)cos
cos(
scanlinearbytaveragetreeontaverage
C
9
Data distribution and index performance Known work : index data in a single
index DIMS tree
Real image data set : Non uniform
Non uniform data aggregate tree
FAST
10
Data type Sparse data tree<scan
Dense data
tree>scan
11
Hybrid data type hybrid index
hybrid index
Sequencial file B + -tree
Sparse data dense data
tree<scan tree>scan
12
How to differentiate data type ? Each data as a unit
difficult
Each cluster ring as a unit easier
13
Clsuter partitioning
cluster middle circleout circle
cluster split
r
Q
O1
QQ Q
O2 O3
inner circle
rr2
r3
What extent ? HuNNMQKNNNodes c
M2)))(((minarg
14
Clsuter partitioning based B+-tree
c x 1 c x 2 c x i c x ( i+1)
O12O11Oi2O13
c x3
Oi1
...
...
Marginal data file
Leafe nodes of B+-treeIndex keywhere c is a hash factor
Q
Query cover area
15
Clsuter partitioning based image retrieval system
Outer rings of custers are often accessed
16
Some rings of custers are often accessed
Treat outer rings as sparse rings?
?
17
Frequence of being accessed for each ring
18
19
Hybrid index - cut branches( according to the contribution of each ring to the query cost )
Expected cost
)+(u
NH ciP(ci)
Cost by linear scan b
Nci
C x 1 C x 2 C x i C x ( i+1)
O12O11Oi2O13
C x3
Oi1
...
...
Marginal data file
Leafe nodes of B+-treeIndex keywhere C is a hash factor
Q
Query cover area
20
Standard of rings being cut - Index
Capability IC ( index capability ):
Question : how to determine ?
)+(u
NHb
N cici P(ci)IC i
P(ci)
21
Estimate - query samping
Question : for large database , lot of queris bring expensive
cost Object : given confidence a% , make minimum
P(ci)
Qqueriesofnumberaccessedbeingcringoftimes i
:P(ci)
Q
22
Threshold of rings being cut When IC equal 0 :
Rule : When the probability of ring being accessed by
queries is lower than this threshold, this ring should remain in the tree, or else, it should be cut into the sequence file for linear scan.
0P(ci)IC i )+(u
NHb
N cici
ci
cicici
bNHubuN
uNH
bN
+=)+(= P(ci)P(ci)
23
Query sampling algorithm : When or or
, stop sampling.
User can balance the accuracy and efficiency of sampling by tuning the confidence a% , and the complexity of this algorithm is less than . N
)1(P(ci) 2/ ntn
SbNHub
uNa
i
ci
ci
+
)1(P(ci) 2/ ntn
SbNHub
uNa
i
ci
ci
+
0ierror
24
Query algorithm of hybrid index Linear scan the sequence file for sparse
data
Retrieve the dense data on the B+-tree
25
Thanks!Thanks!