Indexing Multidimensional Feature SpacesOverview of Multidimensional Index StructureHybrid Tree, Chakrabarti et. al. ICDE 1999Local Dimensionality Reduction, Chakrabarti et. al. VLDB 2000
Queries over Feature SpacesConsider a d-dimensional feature spacecolor histogram, texture, Nature of Queriesrange queries: objects that reside within the region specified in the queryK-nearest neighbor queries: objects that are closest to a query object based on a distance metricApprox. nearest neighbor queries: retrieved object is within (1+ epsilon) of the real nearest neighbor.All-pair (similarity join) queries: retrieve all pairs of objects within a epsilon threshold.A search algorithm may include:false positives: objects that do not meet the query condition, but are retrieved anyway. We tend to minimize false positivesfalse negatives: objects that meet the query condition but are not returned. Usually, approaches avoid false negatives
Approach: Utilize Single Dimensional IndexIndex on attributes independentlyProject query range to each attribute determine pointers.Intersect pointers go to the database and retrieve objects in the intersection.
May result in very high I/O cost
Multiple Key IndexIndex on one attribute provides pointers to an index on the otherIndex on first attributeIndex on second attributeCannot support partial match queries on second attributeperformance of range search not much better compared to independent attribute approachthe secondary indices may be of different sizes -- specifically some of them may be very small
R-tree Data StructureExtension of B-tree to multidimensional space.Paginated, balanced, guaranteed storage utilization.Can support both point data and data with spatial extentGroups objects into possibly overlapping clusters (rectangles in our case)Search for range query proceeds along all paths that overlap with the query.
R-tree Insert Object EStep I1Chooseleaf L to Insert E /* find position to insert*/Step I2If L has room install EElse SplitNode(L)Step I3:Adjust Tree /* propagate Changes*/Step I4:if node split propagates to root adjust height of tree
ChooseLeafStep CL1: Set N to be rootStep CL2: If N is a leaf return NStep CL3: If N is not a root, let F be an entry whose rectangle needs least enlargement to include objectbreak ties by choosing smaller rectangleStep CL4 Set N to be child node pointed by entry Fgoto Step CL2
Split NodeGiven a node split it into two nodes which are each atleast half fullMultiple Objectives:minimize overlapminimize covered areaR-tree minimizes covered areaWhat is an optimal criteria???Minimize overlapMinimize covered area
Minimizing Covered AreaGroup objects into 2 parts such that the covered area is minimizedNP Hard!!Hence use heuriticsTwo heuristics exploredquadratic and linear
Basic Split Strategy/* Divide the set of M+1 entries into 2 groups G1 and G2 */PickSeeds for G1 and G2Invoke PickNext to assign an object to a group recursively until either all objects assigned or one of the groups becomes half full.If one group gets half full assign rest of the objects to the other group.
Quadratic SplitPickSeed:for each pair of entries E1 and E2 compose a rectangle J including E1.rect and E2.rectlet d = area(J) - area(E1.rect) - area(E2.rect) /* d is wasted space */Choose the most wasteful pair with largest d as seeds for groups G1 and G2.PickNext /*select next entry to put in a group */Determine cost of putting each entry in the group G1 and G2for each unassigned entry calculated1 = area increase required in the covering rectangle in Group G1 to include the entryd2= area increase required in the covering rectangle in Group G2 to include the entry.Select entry with greatest preference for a groupchoose any entry with the maximum difference between d1 and d2
Linear SplitPickSeedfind extreme rectangles along each dimensionfind entries with the highest low side and the lowest high siderecord the separationNormalize the separation by width of extent along the dimensionChoose as seeds the pair that has the greatest normalized distance along any dimensionPickNextrandomly choose entry to assign
R-tree Search (Range Search on range S)Start from rootIf node T is not leafcheck entries E in T to determine if E.rectangle overlaps Sfor all overlapping entries invoke search recursivelyIf T is leafcheck each entry to see if it entry satisfies range query
R-tree DeleteStep D1find the object and delete entryStep D2 Condense TreeStep D3if root has 1 node shorten tree height
Condense TreeIf node is underfuldelete entry from parent and add to a set QAdjust bounding rectangle of parentDo the above recursively for all levelsReinsert all the orphaned entries insert entries at the same level they were deleted.
Other Multidimensional Data StructuresMany generalizations of R-treedifferent splitting criteriadifferent shapes of clusters (e.g., d-dimensional spheres)adding redundancy to reduce search cost: store objects in multiple rectangles instead of a single rectangle to reduce cost of retrieval. But now insert has to store objects in many clusters. This strategy also increases overlap causing search performance to detoriate.Space Partitioning Data Structuresunlike R-tree which group objects into possibly overlapping clusters, these methods attempt to partition space into non-overlapping regions.E.g., KD tree, quad tree, grid files, KD-Btree, HB-tree, hybrid tree.Space filling curvessuperimpose an ordering on multidimensional space that preserves proximity in multidimensional space. (Z-ordering, hilbert ordering)Use a B-tree as an index on that ordering
KD-treeA main memory data structure based on binary search treescan be adapted to block model of storage (KD-Btree)Levels rotate among the dimensions, partitioning the space based on a value for that dimensionKD-tree is not necessarily balanced.
KD-Tree OperationsSearch: straightforward. Just descend down the tree like binary search trees.Insertion: lookup record to be inserted, reaching the appropriate leaf.If room on leaf, insert in the leaf blockElse, find a suitable value for the appropriate dimension and split the leaf block
Adapting KD Tree to Block ModelSimilar to B-tree, tree nodes split many ways instead of two waysRisk: insertion becomes quite complex and expensive.No storage utilization guarantee since when a higher level node splits, the split has to be propagated all the way to leaf level resulting in many empty blocks.Pack many interior nodes (forming a subtree) into a block.Riskit may not be feasible to group nodes at lower level into a block productively.Many interesting papers on how to optimally pack nodes into blocks recently published.
Quad TreeNodes split along all dimensions simultaneouslyDivision fixed: by quadrantsAs with KD-tree we cannot make quadtree levels uniform
Quad Tree ExampleX=5X=8X=7X=3SWSENENW
Quad Tree OperationsInsert:Find Leaf node to which point belongsIf room, put it thereElse, make the leaf an interior node and give it leaves for each quadrant. Split the points among the new leaves.Search:straighforward just descend down the right subtree
Grid FilesSpace Partitioning strategy but different from a tree.Select dividers along each dimension. Partition space into cells Unlike KD-tree dividers cut all the way.Each cell corresponds to 1 disk page.Many cells can point to the same page.Cell directory potentially exponential in the number of dimensions
Grid File ImplementationMaintain linear scales for each dimension that contain split positions for the dimensionCell directory implemented as a multidimensional array./* can be large and may not fit in memory */
Grid File SearchExact Match Search: at most 2 I/Os assuming linear scales fit in memory.First use liner scales to determine the index into the cell directoryaccess the cell directory to retrieve the bucket address (may cause 1 I/O if cell directory does not fit in memory)access the appropriate bucket (1 I/O)Range Queries:use linear scales to determine the index into the cell directory.Access the cell directory to retrieve the bucket addresses of buckets to visit.Access the buckets.
Grid File InsertDetermine the bucket into which insertion must occur.If space in bucket, insert.Else, split buckethow to choose a good dimension to split?If bucket split causes a cell directory to split do so and adjust linear scales./* notice that cell directory split results in p^(d-1) new entries to be created in cell directory */insertion of these new entries potentially requires a complete reorganization of the cell directory--- expensive!!!
Grid File InsertInserting a new split position will require the cell directory to increase by 1 column. In d-dim space, it will cause p^(d-1) new entries to be created
Space Filling CurveAssumption finite precision in representing each coordinate.
00 01 10 1100 01 10 11ABCZ(A) = shuffle(x_A, y_A) = shuffle(00,11)= 0101 = 5Z(B) = 11 = 3 (common prefix to all its blocks)Z(C1) = 0010 = 2Z(C2) = 1000 = 8
Deriving Z-Values for a RegionObtain a quad-tree decomposition of an object by recursively dividing it into blocks until blocks are homogeneous.
001011010001110011Objects representationis 0001, 0011,01
Disk Based Storage
For disk storage, represent object based on its Z-valueUse a B-tree index.Range Query:translate query range to Z valuessearch B-tree with Z-values of data regions for matches
Nearest Neighbor SearchRetrieve the nearest neighbor of query poin