33
Introduction to The NSP- Tree: A Space- Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006

Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

  • Upload
    acton

  • View
    40

  • Download
    1

Embed Size (px)

DESCRIPTION

Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method. Gang Qian University of Central Oklahoma November 2006. Summary. Overview Motivation and Existing Work NSP-Tree Structure, Algorithms and Performance Conclusion and Future Work. Overview. - PowerPoint PPT Presentation

Citation preview

Page 1: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Gang Qian

University of Central OklahomaNovember 2006

Page 2: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Summary

Overview Motivation and Existing Work NSP-Tree Structure, Algorithms and

Performance Conclusion and Future Work

Page 3: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Overview

The NSP-tree is a disk-based index structure Similar to B-tree/B+-tree

It is designed to index a large amount of vectors with non-ordered discrete components Domains with discrete values that are not naturally ordered

are very common E.g., gender, profession, genome bases, etc.

It is used to speed up similarity queries over the indexed data Unlike exact queries, a similarity query searches for data

items that are similar to the given query data item

Page 4: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Motivation

Traditional database technology is mature Data model: Relational Data Model Design: ER/EER Diagrams Query: SQL Data integrity: Transaction Processing Index: B-tree/B+-tree Some hard unsolved issues still exist

E.g., Multidimensional Query Optimization

Page 5: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

New problems occur with the increasing demand for the management of non-traditional data types Multimedia data Scientific data Spatial data Temporal data Biological data, etc.

With the new data types, exact queries are no longer useful Similarity queries become more and more important

Page 6: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Vector Model

The Vector Model is one of the very useful tools to support these new data types Many non-traditional data types are vectors or can be

easily converted into vectors E.g., feature vectors for images

Vectors can be deemed as points in high dimensional data spaces

Therefore, the distance between a pair of vectors is a natural quantitative measure of (dis)similarity between two data objects that the two vectors represent E.g., Euclidean distance

Page 7: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

The problem of managing non-traditional databases becomes the problem of managing vector databases

Designing index structures to support efficient similarity queries on vectors is an open research area of vector databases For example, the NSP-tree is designed to index

vectors with discrete and non-ordered components E.g., genome sequence data

Page 8: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Existing Work

A number of index structures are proposed for vectors with continuous numerical components E.g., R-tree and its variants:

SS-tree SR-tree X-tree Hybrid tree, etc.

Due to the volume of the data, almost all proposed index structures are disk-based

Page 9: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

The basic structure of these indices are very similar to that of the B+-tree Hierarchical tree structure Each tree node occupies one and only one disk

block and has a minimum utilization requirement Vectors are stored in leaf nodes Non-leaf nodes contain routing information that is

used for tree construction and searching Routing information are usually represented by a certain

type of minimum bounding shapes Minimum Bounding Rectangle (MBR), Minimum Bounding

Sphere (MBS), etc.

Page 10: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Example: R-Tree Structure

Figure adopted from “The SR-tree: An Index Structure for High-Dimensional Nearest Neighbor Queries” (SIGMOD 1997).

Page 11: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Such an index tree grows in a bottom-up fashion Vectors are incrementally inserted into the tree When a leaf node is full, it is split into two leaves The split of a child in the tree may cause the split of a

parent Node split may propagate all the way up to the root,

when the root itself will be split to create a new root Search works top-down from the root

Search performance is usually measured in terms of the total number of disk blocks/nodes accessed

Search efficiency is derived from pruning branches that are not within the search range Unlike a brute force linear search, vectors in irrelevant

branches will not be visited

Page 12: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Unfortunately, those index trees mentioned in previous slides cannot be directly used for vectors with non-ordered discrete components

The ND-tree was proposed to index such vectors See “The ND-Tree: A Dynamic Indexing Techniqu

e for Multidimensional Non-ordered Discrete Data Spaces” (VLDB 2003)

Page 13: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Discrete Space Concepts

The structure of the ND-tree is very similar to those of the R-tree variants

However, all the underlying geometrical concepts are redefined to accommodate discrete vectors

Euclidean/Continuous Space Discrete Space

Vector Discrete Vector

Rectangle Discrete Rectangle

Area Discrete Area

Euclidean Distance Hamming Distance

… …

Page 14: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Example: Discrete Rectangles Introduced to bound vectors with non-ordered discrete

components Normal rectangle can be deemed as the Cartesian product

of ranges for every dimension in the data space E.g., [0.1, 0.2] [0.7, 0.8] is a two-dimensional rectangle

A discrete rectangle is defined as the Cartesian product of sets of discrete values from every dimension E.g., {a, g} {t, c, g} is a two-dimensional discrete rectangle

that covers vectors such as <a, c>, <g, t> and <g, g> Discrete Minimum Bounding Rectangles (DMBR) store the

routing information for the ND-Tree

Page 15: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Problem of The ND-tree

Overlap in an index tree may dramatically affect its search performance

The construction of the ND-tree cannot totally avoid the overlap among DMBRs in the tree The ND-tree works well when the data is randomly

distributed However, for certain data sets, overlap cannot be avoided

For example, the skewed data set based on the Zipf distribution

To guarantee the minimum disk utilization, the split algorithm may NOT be able to find an overlap-free split for an overflow node

Page 16: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Basic Idea of The NSP-Tree

There are three factors that affect search performance Disk utilization Overlap Fan-out

Maximum number of children of a tree node

Since overlap can not be totally avoided when there is a minimum disk utilization requirement, the design of the NSP-tree dropped the requirement so that overlap-free can be guaranteed

Page 17: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Space-Partitioning Indexing Methods Ideas of overlap-free index structures are not new

What makes the NSP-tree new is that it can handle non-ordered discrete data based on an overlap-free structure

There are a category of index trees that have such a feature KDB-tree hB-tree LSD-tree, etc.

They are called space-partitioning indexing methods R-tree variants are called data-partitioning indexing method

s All previous space-partitioning indices support only v

ectors with continuous numeric components

Page 18: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

0 1

1

0.60.4

0.3

0.2

0.6

0.2 0.75

d:1v: 0.6

d:2v: 0.3

d:2v: 0.6

d:1v: 0.2

d:1v: 0.4

d:2v: 0.2

d:1v: 0.75

<= >

<= > <= >

d: Split dimension

v: Split point on the split dimension

Space-partitioning InformationPartitioned Data Space

Page 19: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Space-Partitioning vs. Data-Partitioning

Space-Partitioning Data-Partitioning

Objects that can be indexed

Vectors onlyVectors and spatial

objects

Minimum Utilization Requirement

No Yes

Guaranteed Overlap-free

Yes No

Fan-out Large Small

Page 20: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

NSP-Tree Structure

Similar to those of the B+-tree and the R-tree, but with no minimum disk utilization requirement Each node occupies one disk block Vectors are stored in leaf nodes Space-partitioning information are stored in non-leaf nodes

The space concept in the NSP-tree is discrete A discrete data space is defined as the Cartesian product

of the sets of all possible values on every dimension Due to the non-ordered nature of the values, a split point

on a split dimension is no long enough to describe a split Need to explicitly record how each values on a dimension are

separated into two groups

Page 21: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Structure of The NSP-Tree

Page 22: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Routing Information: Split History Tree (SPT)

Page 23: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Conceptually, each node corresponds to a subspace of the discrete data space A subspace is defined as the Cartesian product of

the subsets of values on every dimension There is no overlap among the subspaces of the

children on the same level The subspace of a parent node contains the

subspaces of all its children

Page 24: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Eliminating Dead Space

One disadvantage of a pure space-partitioning approach is that the subspaces do not necessarily minimally bound the vectors in the space See next slide

To further improve the pruning power, DMBRs are used as additional routing information in tree

However, the use of DMBRs reduces the fan-out of tree More space in a node is needed to store the DMBRs We found that the benefits of using DMBRs are usually

greater than the disadvantage of the decrease of the fan-out

Page 25: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

0 1

1

0.60.4

0.3

0.2

0.6

0.2 0.75

Actual Minimum Bounding Rectangle

Subspace is not minimum bounding

Dead space

r

Q

Page 26: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Tree Construction Algorithms An NSP-tree grows incrementally

Vectors are inserted one by one Insertion starts from the root and goes down the

tree until a suitable leaf node is found for the new vector

The tree grows in a bottom-up fashion There are two import algorithms used in the

insertion procedure ChooseSubtree SplitNode

Page 27: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

ChooseSubtree Starting from the root, it is invoked on non-leaf nodes Given the vector to insert, the algorithm decides which child

nodes to follow based on whether a child’s subspace contains the new vector or not Due to the overlap-free property, there exists at most one

child that can contain the new vector SplitNode

Splits an overflow node into two nodes The split guarantees overlap-free It also tries to maximize disk utilization by choosing the

most balanced split

Page 28: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

There are other algorithms for the NSP-tree Generating and maintaining DMBRS Query Deletion, etc.

Page 29: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Query Performance

Page 30: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Disk Utilization

Page 31: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Summary

The NSP-tree is the first indexing method that uses the space-partitioning approach to index vectors with non-ordered discrete components

The benefit of using an overlap-free tree structure is obvious when data distribution is skewed

With proper heuristics, the disadvantage of the removal of the minimum disk utilization requirement can be minimized

In general, the benefit of using DMBRs to eliminate dead space (hence, increasing the pruning power) overrides the disadvantage of the fan-out decrease

Page 32: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method

Future Work

Bulkloading the NSP-tree and the ND-tree Insert more than one vector at a time

Support approximate similarity queries Beat the Curse of High Dimensionality

Support queries based on the Editor Distance Besides the Hamming distance, the Editor

distance is another widely-used distance measure for discrete vectors

Aggregate all the technology into a viable bioinformatics search engine

Page 33: Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method