Multidimensional Indexing: Spatial Data Management & High Dimensional Indexing
Types of Spatial DataPoint DataPoints in a multidimensional spaceE.g., Raster data such as satellite imagery, where each pixel stores a measured valueE.g., Feature vectors extracted from textRegion DataObjects have spatial extent with location and boundaryDB typically uses geometric approximations constructed using line segments, polygons, etc., called vector data.
Spatial IndexingPoint Access Methods (PAMs) vs Spatial Access Methods (SAMs)PAM: index only point dataHierarchical (tree-based) structuresMultidimensional HashingSpace filling curveSAM: index both points and regionsTransformationsOverlapping regionsClipping methods (non-overlapping)Data partitioning vs Space partitioning
Types of Spatial QueriesSpatial Range QueriesFind all cities within 50 miles of TroyQuery has associated region (location, boundary)Answer includes overlapping or contained data regionsNearest-Neighbor QueriesFind the 10 cities nearest to TroyResults must be ordered by proximitySpatial Join QueriesFind all cities near a lakeExpensive, join condition involves regions and proximity
Applications of Spatial DataGeographic Information Systems (GIS)E.g., ESRIs ArcInfo; OpenGIS ConsortiumGeospatial informationAll classes of spatial queries and data are commonComputer-Aided Design/ManufacturingStore spatial objects such as surface of airplane fuselageRange queries and spatial join queries are commonMultimedia DatabasesImages, video, text, etc. stored and retrieved by contentFirst converted to feature vector form; high dimensionalityNearest-neighbor queries are the most common
RequirementsFast range/window query search (range queryFast similarity searchSimilarity range queryK-nearest neighbour query (KNN query)
High Dimensional Indexing
Complex ObjectsFeature VectorsSimilarity QueriesFeature extraction and transformationIndex constructionIndex for range/ similarity SearchFeature Base Similarity Search
Similarity Search based on sample image in color compositionRetrieval by ColourGiven a sample image
Window/Range query: Retrieve data points fall within a given range along each dimension.Designed to support range retrieval, facilitate joins and similarity search (if applicable).Query Requirement
Similarity queries: Similarity range and KNN queries Similarity range query: Given a query point, find all data points within a given distance r to the query point.
KNN query: Given a query point, find the K nearest neighbours, in distance to the point.rKth NNQuery Requirement
Single-Dimensional IndexesB+ trees are fundamentally single-dimensional indexes.When we create a composite search key B+ tree, e.g., an index on , we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal.Consider entries:, , 11 12 137060504030201080B+ treeorder
Multidimensional IndexesA multidimensional index clusters entries so as to exploit nearness in multidimensional space.Keeping track of entries and maintaining a balanced index structure presents a challenge!Consider entries:, ,
Motivation for Multidimensional IndexesSpatial queries (GIS, CAD).Find all hotels within a radius of 5 miles from the conference venue.Find the city with population 500,000 or more that is nearest to Kalamazoo, MI.Find all cities that lie on the Nile in Egypt.Find all parts that touch the fuselage (in a plane design).Similarity queries (content-based retrieval).Given a face, find the five most similar faces.Multidimensional range queries.50 < age < 55 AND 80K < sal < 90K
Whats the difficulty?An index based on spatial location needed.One-dimensional indexes dont support multidimensional searching efficiently. Hash indexes only support point queries; want to support range queries as well.Must support inserts and deletes gracefully.Ideally, want to support non-point data as well (e.g., lines, shapes).
Multi-key IndexesGrid FilesPartitioned Hash Indexeskd-TreesQuad TreesR TreesBitmap indexes
Key1 Key2Partitioned hash functionh1h2010110 1110010
Find Emp. with Dept. = Sales Sal=40k
Find Emp. with Sal=30k
Find Emp. with Dept. = Sales
Grid FileHashing methods for multidimensional points (extension of Extensible hashing)Idea: Use a grid to partition the space each cell is associated with one pageTwo disk access principle (exact match)
Grid FileStart with one bucket for the whole space.Select dividers along each dimension. Partition space into cells Dividers cut all the way.Each cell corresponds to 1 disk page.Many cells can point to the same page.Cell directory potentially exponential in the number of dimensions
Grid File ImplementationDynamic structure using a grid directoryGrid array: a 2 dimensional array with pointers to buckets (this array can be large, disk resident) G(0,, nx-1, 0, , ny-1)Linear scales: Two 1 dimensional arrays that are used to access the grid array (main memory) X(0, , nx-1), Y(0, , ny-1)
ExampleLinear scale X Linear scaleY Grid DirectoryBuckets/DiskBlocks
Grid File SearchExact Match Search: at most 2 I/Os assuming linear scales fit in memory.First use liner scales to determine the index into the cell directoryaccess the cell directory to retrieve the bucket address (may cause 1 I/O if cell directory does not fit in memory)access the appropriate bucket (1 I/O)Range Queries:use linear scales to determine the index into the cell directory.Access the cell directory to retrieve the bucket addresses of buckets to visit.Access the buckets.
Grid File InsertionsDetermine the bucket into which insertion must occur.If space in bucket, insert.Else, split buckethow to choose a good dimension to split?If bucket split causes a cell directory to split do so and adjust linear scales.insertion of these new entries potentially requires a complete reorganization of the cell directory--- expensive!!!
Grid File DeletionsDeletions may decrease the space utilization. Merge bucketsWe need to decide which cells to merge and a merging thresholdBuddy system and neighbor systemA bucket can merge with only one buddy in each dimensionMerge adjacent regions if the result is a rectangle
(N=6)123456Grid File Example
89101112 (N=6)Grid File Example
(N=6)1415Grid File Example
(N=6)Grid File Example
(N=6)Grid File Example
Kd-Trees Binary partitioning of space. Split of the form a < V & a >= V for some attribute (Internal nodes)The dimensions to cut or split alternate among all dimensionsDoesnt have to span the whole dim (unlike Grid Files)Leaves are blocks that hold the points
kd (N=6)Kd-Trees Example
kDB Trees Example
The R-TreeThe R-tree is a tree-structured index that remains balanced on inserts and deletes.Each key stored in a leaf entry is intuitively a box, or collection of intervals, with one interval per dimension.Example in 2-D:
- R-Tree PropertiesLeaf entry = < n-dimensional box, rid >key value is a box.Box is the tightest bounding box for a data object.Non-leaf entry = < n-dim box, ptr to child node >Box covers all boxes in child node (in fact, subtree).All leaves at same distance from root.Nodes can be kept 50% full (except root).Can choose a parameter m that is
Example of an R-TreeR8R9R10R11R12R17R18R19R13R14R15R16R1R2R3R4R5R6R7Leaf entryIndex entrySpatial objectapproximated by bounding box R8
Example R-Tree (Contd.)R1R2R3R4R5R6R7R8R9R10R11R12R13R14R15R16R17R18R19
Search for Objects Overlapping Box QStart at root.1. If current node is non-leaf, for each entry , if box E overlaps Q, search subtree identified by ptr.2. If current node is leaf, for each entry , if E overlaps Q, rid identifies an object that might overlap Q.Note: May have to search several subtrees at each node!(In contrast, a B-tree equality search goes to just one leaf.)
Improving Search Using ConstraintsIt is convenient to store boxes in the R-tree as approximations of arbitrary regions, because boxes can be represented compactly.But why not use convex polygons to approximate query regions more accurately?Will reduce overlap with nodes in tree, and reduce the number of nodes fetched by avoiding some branches altogether.Cost of overlap test is higher than bounding box intersection, but it is a main-memory cost, and can actually be done quite efficiently. Generally a win.
Insert Entry Start at root and go down to best-fit leaf L.Go to child whose box needs least enlargement to cover B; resolve ties by going to smallest area child.If best-fit leaf L has space, insert entry and stop. Otherwise, split L into L1 and L2.Adjust entry for L in its parent so that the box now covers (only) L1.Add an entry (in the parent node of L) fo