48
Multidimensional Data Structures • Why do we need indexing? – Quick access • Search methods – sequential search – binary search – balanced search tree (B or B+ tree) • Why do we need new indexing structures – Traditional data is one-dimensional – multimedia data is multidimensional • in general, if a given information has k features, it can be represented by a k- dimensional space, where each dimension corresponds to one feature

Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Embed Size (px)

Citation preview

Page 1: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Multidimensional Data Structures

• Why do we need indexing?– Quick access

• Search methods– sequential search– binary search– balanced search tree (B or B+ tree)

• Why do we need new indexing structures– Traditional data is one-dimensional– multimedia data is multidimensional

• in general, if a given information has k features, it can be represented by a k-dimensional space, where each dimension corresponds to one feature

Page 2: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

What kind of queries may be expected?

• Given a set of points in k-dimensional space– we may want to see if the point is in the set or not (exact

match)– we may want to find the closest points to the given point

(similarity based search or nearest neighbor queries)– given a region, we may want to find all the points in the

given region (range query)• Approach

– divide the space into regions– insert the new object corresponding region– if the region is full, split the region

• Query– determine which regions are required to answer the

query, and limit the search to these regions

Page 3: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Multidimensional Data Structures

• K-d Trees

• Point Quadtrees

• MX-Quadtrees

• R-Trees

• Many others exist– We do not discuss them in the class

Page 4: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

K-d Trees

• Used to store k dimensional point data

• Not used to store region data

• A 2-d tree (k=2) stores 2-dimensional point data while 3-d tree stores 3-dimensional point data, ..

Page 5: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

2-d trees

• Node Structure nodetype = record INFO: infotype; XVAL: real; YVAL: real; LLINK: nodetype; RLINK: nodetype; end

INFO filed is any user-defined type XVAL and YVAL denote the coordinates of a point associated with

the node LLINK and RLINK fields point to two children

INFO XVAL YVAL

LLINK RLINK

Page 6: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

2-d Trees

• 2-d tree is a binary tree satisfying the following properties– If N is node in the tree such that level(N) is even, then

• every node M in the subtree rooted at N.LLINK satisfies M.XVAL < N.XVAL

• every node P in the subtree rooted at N.RLINK satisfies P.XVAL N.XVAL

– If N is node in the tree such that level(N) is odd, then • every node M in the subtree rooted at N.LLINK satisfies

M.YVAL < N.YVAL

• every node P in the subtree rooted at N.RLINK satisfies P.YVAL N.YVAL

Page 7: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

2-d Trees: Insertion/Search

30,50

A1 < 30 A1 >= 30

60,10

A2 < 10 A2 >= 10

45,20

A1 < 45 A1 >= 45

A1

A2

20

20

40

40 60

60

(30,50)

(45,20)

(60,10)

• To insert N into a tree pointed by T,– Check if N and T agree on their XVAL and YVAL– IF so, just overwrite T and we are done– Else branch left if N.XVAL < T.XVAL and branch right otherwise– Suppose P is the child. If N and P agree on their XVAL and YVAL,

overwrite P and we are done, else branch left if N.YVAL < P.YVAL and branch right otherwise

– Repeat this procedure, branching on XVALs when we are at even levels, and YVALs when we are at odd levels in the tree

Page 8: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Another example of 2-d tree

City

Banja Luka (19,45)

Derventa (40,50)

Toslic (38,38)

Tuzla (54,35)

Sinj (4,4)

(XVAL,YVAL)

Page 9: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

City

Banja Luka (19,45)

Derventa (40,50)

Toslic (38,38)

Tuzla (54,35)

Sinj (4,4)

(XVAL,YVAL)

Page 10: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Example of Insertion

City

Banja Luka (19,45)

Derventa (40,50)

Toslic (38,38)

Tuzla (54,35)

Sinj (4,4)

(XVAL,YVAL)

Page 11: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Example of Insertion

City

Banja Luka (19,45)

Derventa (40,50)

Toslic (38,38)

Tuzla (54,35)

Sinj (4,4)

(XVAL,YVAL)

Page 12: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Example of Insertion

City

Banja Luka (19,45)

Derventa (40,50)

Toslic (38,38)

Tuzla (54,35)

Sinj (4,4)

(XVAL,YVAL)

Page 13: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Example of Insertion

City

Banja Luka (19,45)

Derventa (40,50)

Toslic (38,38)

Tuzla (54,35)

Sinj (4,4)

(XVAL,YVAL)

Page 14: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Deletion in 2-d Trees• Suppose we wish to delete (x,y) from a 2-d tree T

– search for a node N in T with N.XVAL = x and N.YVAL = y– if N is a leaf node, then set the LLINK and RLINK fields of N’s

parent to NIL and return N to appropriate storage

– otherwise, either the subtree rooted at N.LLINK (Tl) or the subtree rooted at N.RLINK (Tr) is non-empty

• Step1: Find a “candidate replacement” node R that occurs either in Tl or in Tr

• Step2: Replace all of N’s non-link fields by those of R

• Step3: Recursively delete R from Tl or Tr (whichever is applicable)

– the above recursion is guaranteed to terminate because Tl (Tr) has strictly smaller height than the original tree T

Page 15: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Finding Candidate Replacement Node• The candidate replacement node R must bear the same

spatial relation to all nodes P in both Tl or Tr that N

bore to P

• I.e, if P is to the southwest of N, then P must be to the southwest of R, if P is to the northwest of N, then P must be to the northwest of R, ….

• This means R must satisfy the following properties:– 1. every node M in Tl is such that: M.XVAL < R.XVAL if

level(N) is even and M.YVAL < R.YVAL if level(N) is odd

– 2. every node M in Tr is such that: M.XVAL R.XVAL if

level(N) is even and M.YVALR.YVAL if level(N) is odd

Page 16: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Finding Candidate Replacement Node• If Tr is not empty, and level(N) is even, then any node

in Tr with smallest possible XVAL in Tr is a candidate

replacement node

• If Tr is empty, then we might not be able to find a

candidate replacement node from Tl

• In this case, find the node R’ in Tl with the smallest

possible XVAL filed. Replace N with this

• Set N.RLINK = N.LLINK and set N.LLINK = NIL

• Recursively delete R’

Page 17: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Range queries in 2-d trees• A range query with respect to a 2-d tree T is a query that specifies

a point (xc,yc) and a distance r

• the answer is the set of all points (x,y) in T such that (x,y) lies within distance r of (xc,yc)

• I.e., a range query defines a circle of radius r centered at (xc,yc), and expects to find all points in the 2-d tree that lie within the circle

• recall that each node N in T implicitly represents a region RN

• If the circle specified in a query has no intersection with RN, then there is no point in searching the subtree rooted at N

Page 18: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Example of the Range query

(xc,yc) = (35,46)r = 9.5

Page 19: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

K-d trees

• It is a binary tree

• Every node contains a data record, a left pointer and a right pointer

• At every level of the tree, a different attribute of the tree is used as the discriminator in a round-robin fashion

• All algorithms for 2-d trees generalize in the obvious way to k-d trees

Page 20: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Point Quadtrees• Point quad trees always split regions into 4 parts• In a 2-d tree, node N splits a region into two by drawing one line

through the point (N.XVAL,N.YVAL)• In a point quadtree, node N splits the region it represents by

drawing both horizontal and vertical line through the point (N.XVAL,N.YVAL)

• These 4 parts are called the NW, SW, NE, SE quadrants determined by node N; each of these corresponds to a child of N

• Node Structure qtnodetype = record INFO: infotype; XVAL: real; YVAL: real; NW,SW,NE,SE: qtnodetype; end

Page 21: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Insertion into Point Quadtrees

City (XVAL,YVAL)

Banja Luka (19,45)

Derventa (40,50)

Toslic (38,38)

Tuzla (54,35)

Sinj (4,4)

Page 22: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Insertion into Point Quadtrees

City (XVAL,YVAL)

Banja Luka (19,45)

Derventa (40,50)

Toslic (38,38)

Tuzla (54,35)

Sinj (4,4)

Page 23: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Insertion into Point Quadtrees

City (XVAL,YVAL)

Banja Luka (19,45)

Derventa (40,50)

Toslic (38,38)

Tuzla (54,35)

Sinj (4,4)

Page 24: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Insertion into Point Quadtrees

City (XVAL,YVAL)

Banja Luka (19,45)

Derventa (40,50)

Toslic (38,38)

Tuzla (54,35)

Sinj (4,4)

Page 25: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Insertion into Point Quadtrees

City (XVAL,YVAL)

Banja Luka (19,45)

Derventa (40,50)

Toslic (38,38)

Tuzla (54,35)

Sinj (4,4)

Page 26: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Insertion into Point Quadtrees

City (XVAL,YVAL)

Banja Luka (19,45)

Derventa (40,50)

Toslic (38,38)

Tuzla (54,35)

Sinj (4,4)

Page 27: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Deletion in Point Quadtrees

• If the node being deleted is a leaf node, deletion is trivial: we just set the appropriate link filed of node N’s parent to NIL and return node to storage

• otherwise, as in the case of 2-d trees, we need to find an appropriate replacement node

Page 28: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Expanded Node Type• Expand the node structure to the following qtnodetype = record INFO: infotype; XVAL,YVAL: real; XLB,YLB,XUB,YUB: real {-,+} NW,SW,NE,SE: qtnodetype; end • When inserting a node N into T we need to ensure that

– If N is the root node, then N.XLB = - , N.YLB = - , N.XUB = +, N.YUB = +

• If P is the parent of N (assume w=P.XUB-P.XLB and h=P.YUB-Y.YLB), then

Case N.XLB N.XUB N.YLB N.YUBN=P.NW P.XLB P.XLB+w*.5 P.YLB+h*.5 P.YUBN=P.SW P.XLB P.XLB+w*.5 P.YLB P.YLB+h*.5N=P.NE P.XLB+w*.5 P.XUB P.YLB+h*.5 P.YUBN=P.SE P.XLB+w*.5 P.XUB P.YLB P.YLB+h*.5

Page 29: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Deletion in Point Quadtrees • When deleting an interior node N, we must find a

replacement node R in one of the subtrees of N such that– every other node R1 in N.NW is to the northwest of R

– every other node R2 in N.SW is to the southwest of R

– every other node R3 in N.NE is to the northeast of R

– every other node R4 in N.SE is to the southeast of R

• In general, it may not always be possible to find such a replacement node – deletion of an interior node N may require reinsertion of all nodes in

the subtrees of N

– In the worst case, this may require almost all nodes to be reinserted

Page 30: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Range Searches in Point Quadtree

• Similar to that of 2-d trees

• each node in a point quadtree represents a region

• do not search regions that do not intersect the circle defined by the query

Page 31: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

The MX-Quadtree

• For both 2-d trees and point quadtrees, – the shape of the tree depends on the order in which

objects are inserted– the split (into 2 for 2-d and 4 for point quad) may be

uneven depending on exactly where the point (N.XVAL,N.YVAL) is located inside the region represented by N

• MX-quadtrees (MX stands for matrix) attempt to– ensure that the shape (and height) of the tree are

independent of the number of nodes present in the tree as well as the order of insertion of these nodes

– provide efficient deletion and search algorithms

Page 32: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

The MX-Quadtree• The map being represented is split up into a grid of size

(2k2k) of some k • the application developer is free to choose k to reflect the

desired granularity, but once chosen, this must be kept fixed• Node structure: exactly same as that of point quadtrees,

except that the root represents the region specified by XLB=0, XUB=2k, YLB=0, YUB=2k

• When a region gets split, it gets split in the middle– the regions represented by the four children of N (w denotes the

width of the region represented by N

Child XLB XUB YLB YUBNW N.XLB N.XLB+w/2 N.YLB+w/2 N.YLB+wSW N.XLB N.XLB+w/2 N.YLB N.YLB+w/2NE N.XLB+w/2 N.XLB+w N.YLB+w/2 N.YLB+w SE N.XLB+w/2 N.XLB+w N.YLB N.YLB+w/2

Page 33: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Insertion into MX-Quadtree

Page 34: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Insertion into MX-Quadtree

Page 35: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Deletion in MX-Quadtrees

• It is fairly simple because all points are represented at the leaf level

• To delete N,– First set the appropriate link of N’s parent (M) to NIL– Check if all the four link fields of M are NIL– If so, examine M’s parent (P), find the link field P.dir1 =

M, set P.dir1 = NIL, and see if P’s four link fields are NIL

– If so, continue this process– Complexity of deletion is O(k)

Page 36: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Range Queries in MX-Quadtrees

• Handled exactly the same way as for point quadtrees, but there are 2 differences– the content of XLB,XUB,YLB,YUB fields are different – as points are stored at the leaf level, checking to see if a

point is in the circle defined by the range query needs to be performed only at the leaf level

Page 37: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

R-Trees• R-trees are used to store rectangular regions of an image or

map • R-trees are particularly useful in storing very large amounts

of data on disk• They provide a convenient way of minimizing the number of

disk accesses• Each R-tree has an associated order, K• Each non-leaf node contains at most K rectangles and at least

K/2 rectangles (except root) (I.e, each non-root node must be at least half full)

• This makes R-trees appropriate for disk based retrieval (because each disk access brings back at least K/2 rectangles)

Page 38: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

R-Trees

Manipulate two kinds of rectangles: “real” and “group”

Page 39: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

R-Tree

• Node Structure rtnodetype = record Rec1, …. RecK: rectangle; P1,…PK: rtnodetype; end

Page 40: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Insertion into an R-tree

Page 41: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Insertion into an R-tree

Page 42: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Insertion into an R-tree

Page 43: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

An incorrect insertion into an R-tree

Page 44: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Deletion in R-Trees• Deletion may cause a node to “underflow” because

an R-Tree must contain at least K/2 rectangles (real or group) (Recall B+-trees)

• When we delete a rectangle, we must make sure that the node is not underfull

Page 45: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Comparison of the four data structures

• k-d trees – easy to implement– k-d tree with k nodes may have height k – since the trees are binary, search and insertion are

expensive

• point quadtrees– easy to implement– comparison requires comparison of 2 attributes– deletion is difficult– complexity of range queries O(2n) where n is the

number of records in the tree

Page 46: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Comparison of the four data structures• MX-quadtrees

– height is at most O(n) where the region represented is composed of (2n2n) cells

– insertion, deletion, search: O(n)– range search is very efficient O(N+2h) where N is the

number of points in the answer to the query and h is the height of the tree

• R-tree– insertion, deletion, search, same as MX-quadtrees– since large number of rectangles are stored in each node,

they are appropriate for disk accesses– the bounding rectangles may overlap (I.e., we may have

to search via multiple paths)

Page 47: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Other multidimensional data structures

• K-d-B trees

• hB-tree

• PMR-tree

• R*-tree

• ….

Page 48: Multidimensional Data Structures Why do we need indexing? –Quick access Search methods –sequential search –binary search –balanced search tree (B or B+

Commercial Systems• Informix’s MapInfo Geocoding datablade: allows assignment

of latitudinal and longitudinal elements to records• Informix’s Spatial datablade: employs R-tree • Oracle Universal server provides a spatial data option and is

based on quadtree technology• Intergraph’s Land Information System allows integration of

survey data, imagery, etc., Allows to create temporal/historical view of landuse

• ESRI provides ARC/INFO system– the spatial database engine works with geographic data stored in

Oracle, Informix, Sybase, Microsoft SQL server, and DB2– Interesting to see what data structure do they employ!!