Upload
clifford-lucas
View
222
Download
0
Embed Size (px)
Citation preview
Chapter 5
Multidimensional Indexes
One dimensional index can be used to support multidimensional query.
F1=‘abcd’ F2= 123 ‘abcd#123’
Applications Needing Multiple Dimensions
• Geographic Information Systems
• Data Cubes
Geographic Information Systems
In GIS, data are stored in a two-dimensional space such as map.
school
Road1
r
o
a
d
2
House1
House2
o pipeline
Typical Queries of GIS
• Partial match queries
• Range queries
• Near-neighbor queries
• Where-am-I queries
Data Cubes
Data with multiple properties can be seen as existing in a high-dimensional space.
Multidimensional data is gathered by many corporations for decision-support applications
An Example of Data Cube
A chain store may record each sale made, including:• The day and time• The store at which the sale was made• The item purchased• The color of the item• The size of the item• The other properties
Give the sales of pink shirts for each store and each month of 1998
Multidimensional Queries in SQL
Multidimensional data can be stored in a conventional relational database and we can query them in SQL.
Finding the nearest points to (10.0, 20.0)
Store points in the relation Points (x, y) with x and y representing the x- and y-coordinates
SELECT *
FROM POINTS p
WHERE NOT EXISTS(
SELECT *
FROM POINTS q
WHERE (q.x-10)*(q.x-10)+(q.y-20)*(q.y-20)<
(p.x-10)*(p.x-10)+(p.y-20)*(p.y-20)
);
Finding the rectangles that contain (10.0, 20.0)
rectangles ( ID,xll,yll,xur,yur)
SELECT id
FROM rectangles
WHERE xll<=10 AND yll<=20 AND
xur>=10 AND yul>=20 ;
Summarizing the sales of pink shirts
Sales ( day , store , item , color , size )
SELECT day, store, count(*) AS totalSales
FROM sales
WHERE item=‘shirt’ AND color=‘pink’
GROUP By day,store;
Executing Range Queries Using Conventional Indexes
• Given ranges in all dimensions, suppose we build a secondary index B+ tree for each dimension. • Using B+ tree for each dimension, we could get pointers to all of records in the range for that dimension. • We intersect these pointers to get final range query results.
The disk I/O for range query includes:
• to find the way down the B-Trees
• to examine leaf nodes of each B-tree
• to retrieve all the matching records
Range query asking for pointers in the square of side 100 surrounding the center of the space
10,000
100,000
1000
1000Disk I/O: 2X(100,000/200+1)+ Number of Data Blocks containing the desired points (at worst 10,000)
Little Help
100
Look at every block of data file
Suppose a leaf node holding 200 key-point pairs, a block holding 100 records
Access the 100,000 pointers in either dimension.
Executing Nearest-Neighbor Queries Using Conventional Indexes
1. picking a range in each dimension2. asking the range query3. selecting the point closest to the target
within that range
Two things that could go wrong:
• No points within distance d of the given point
to repeat the entire process with a higher value of d
• The distance from the target to the closest point d’ > d
to repeat the search with d’ in place of d
*
*Closest point in range
*Possible closer point
Disk I/O to find the nearest neighbor to (10.0, 20.0)
• Pick d = 1• Examine B-tree for the x-coordinate with range
query (10.0-d=9)<=x<=(10.0+d=11)• Get about 2,000 points• Traverse at least 10 leaves, most likely 11• One disk I/O for an intermediate node• Another 12 disk I/O’s for y-coordinate• One more disk to retrieve the desired record• A total of 25 disk I/O’s
Significantly more disk I/O’s
Multidimensional Index Structures
1. Hash-table-like approaches (1) Grid Files (2) Partitioned Hash Functions2. Tree-like approaches (1) Multiple-Key Indexes (2) kd-Trees (3) Quad Trees (4) R-tree3. Bitmap Indexes
Grid Index Key 2
X1 X2 …… Xm
V1 V2
Key 1
Vn
To records with key1=V3, key2=X2
Customers who bought gold jewelry:
*
*
*
* *
***
*
**
*
0 40 55 100
500K
225K
90K
0
Salary
Age
(25,60) (45,60) (50,75) (50,100) (50,120) (70,110) (85,140) (30,260) (25,400) (45,350) (50,275) (60, 260)
• How is Grid Index stored on disk?
Like
Array... X1
X2
X3
X4
X1
X2
X3
X4
X1
X2
X3
X4
V1 V2 V3
Problem:
• Need regularity so we can computeposition of <Vi,Xj> entry
Solution: Use Indirection
BucketsV1
V2
V3 *Grid onlyV4 contains
pointers to buckets
Buckets------
------
------
------
------
X1 X2 X3
The grid file representing database of customers
30,260 25,400
25,60
45,60 50,75
50,100 50,120
45,350 50,275
60,260
70,110 85,140
0-40 40-55 55+
225+
90-225
0-90
Lookup in a Grid File
The positions of the point in each of the dimensions together determine bucket.
Insertion Into Grid Files
Lookup the record; place the new record in
that bucket. If no room, there are two
general approaches as follows:
(1) Add overflow blocks to the bucket.
(2) Reorganize the structure by adding or moving the grid lines
Insertion of the point (52,200) followed by splitting of buckets
*
*
*
* *
****
**
*
0 40 55 100
500K
225K130K90K
0
Salary
Age
*
Performance of Grid Files
• Lookup of Specific Points Read: 1 disk I/O, Insertion/Deletion: 2 disk I/O (+1
if the creation of an overflow block)• Partial-Match Queries Look at all the buckets in a row or column of the
bucket matrix• Range Queries Look at all the buckets that cover the range
defined by range queries.• Nearest-Neighbor Queries Not easy to put an upper bound on how costly the
search is.
Idea:
Key1 Key2
Partitioned hash function
h1 h2
010110 1110010
h1(toy) =0 000h1(sales) =1 001h1(art) =1 010
. 011
.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111
.
.
<Fred,toy,10k>,<Joe,sales,10k><Sally,art,30k>
EX:
Insert
<Joe><Sally>
<Fred>
h1(toy) =0 000h1(sales) =1 001h1(art) =1 010
. 011
.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111
.
.• Find Emp. with Dept. = Sales Sal=40k
<Fred><Joe><Jan>
<Mary>
<Sally>
<Tom><Bill><Andy>
h1(toy) =0 000h1(sales) =1 001h1(art) =1 010
. 011
.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111
.
.• Find Emp. with Sal=30k
<Fred><Joe><Jan>
<Mary>
<Sally>
<Tom><Bill><Andy>
look here
h1(toy) =0 000h1(sales) =1 001h1(art) =1 010
. 011
.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111
.
.• Find Emp. with Dept. = Sales
<Fred><Joe><Jan>
<Mary>
<Sally>
<Tom><Bill><Andy>
look here
Comparison of Grid Files and Partitioned Hashing
• Grid files are good at nearest-neighbor queries or range queries.
• Partitioned hashing is good at partial match queries.
Tree-Like Structure for Multidimensional Data
• Multiple-key indexes
• Kd-trees
• Quad trees
• R-trees
Motivation: Find records where
DEPT = “Toy” AND SAL > 50k
Multi-key Index
Strategy I:
• Use one index, say Dept.
• Get all Dept = “Toy” records and check their salary
I1
• Use 2 Indexes; Manipulate Pointers
Toy Sal> 50k
Strategy II:
• Multiple Key Index
One idea:
Strategy III:
I1
I2
I3
Index on first attribute
Indexes on second attribute
Example
ExampleRecord
DeptIndex
SalaryIndex
Name=JoeDEPT=SalesSAL=15k
ArtSalesToy
10k15k17k21k
12k15k15k19k
Performance of Multiple-Key Indexes
• Partial-Match Queries
quite efficient for the first attribute
• Range Queries
quite well for a range query
• Nearest-Neighbor Queries
the same strategy as the other index structures
Partial-Match Queries
• If the first attribute is specified, the access is quite efficient.
• If the second attribute is specified, the access is time-consuming.
Range Queries
• Range query on the first attribute to find all of the subindexes
• Search each of these subindexes, using the range specified for the second attribute
• ……
Nearest-Neighbor Queries
• Pick a distance d.• Ask range query x0-d<=x<=x0+d and y0-
d<=y<=y0+d.• Find a closest point within this range• If no points within the range or the distance
from (x0,y0) of the closest point greater than d, increase the range and search again.
kd-Trees(k-dimensional search tree)
• Generalization of the binary search tree to multidemensional data.
• Interior nodes with an associated attribute A and its dividing value V.
• The attributes rotating at different levels of the tree.
• Leaves with blocks holding data records.
Salary 150
Age 60 Age 47
Salary 80
70,110
85,140
Age 38
25,60
45,60
50,75
50,100
50,120
Salary 300 50,275
60,260
30,260
25,400
45,350
A kd-tree example
Tree after insertion of (35,500)
Salary 150
Age 60 Age 47
Salary 80
70,110
85,140
Age 38
25,60
45,60
50,75
50,100
50,120
Salary 300 50,275
60,260
30,260
25,400
45,350
年龄 35
25,400 35,500
45,350
Complex Queries on kd-tree
• Partial-Match Queries
ask for all points with age = 50
• Range Queries
ask for all points with ages 35 to 55 and salaries $100K to $200K
• Nearest-Neighbor Queries
use the same approach as discussed before
Partial-Match Queries ( ask for all points with age = 50)
• Explore both ways at the level with the unknown attribute.
• Go one way at the level with the specified attribute.
Range Queries (ask for all points with ages 35 to 55 and
salaries $100K to $200K)
• If the range straddles the splitting value, explore the two children
• Otherwise, move to only one child.
Nearest-Neighbor Queries
• Treat them as range queries
• Repeat with a larger range if necessary
Two approaches to improve• Multiway Branches at Interior Nodes• Group Interior Nodes Into Blocks
Problem: (1) long paths: log2n for a kd-tree with n leaves. (2) unused space: interior nodes with little info.
Multiway Branches at Interior Nodes
• Interior nodes with many key-pointer pairs
• Keeping distribution and balance as we do for B-tree
Group Interior Nodes Into Blocks
• Packing many interior nodes into a single block.
• Including in one block a node and its descendants for some number of levels
Quad Trees
• Data points are contained in a square region.
• If data points in a square can fit in a block, the square will be a leaf of the tree.
• Otherwise, the square will be an interior node, with children corresponding to its four quadrants.
Data organized in a quad tree
400k
*
*
*
*
*
* *
*
* * *
*
0 100
Salary
Age
A quad tree
R-Trees(Region Tree)
• The R-tree node represents a data region which has subregions as its children.
• The data region can be of any shape.
• The subregions do not cover the entire region.
• The subregions are allowed to overlap.
The region of an R-tree node and subregions of its children
“Where-am-I” Query
• Start at the root.
• Examine the subregions at the root to see whether they contain point P
• If there are zero regions, P is not in any data region;
If there is at least one interior region that contains P, recursively search for P until reaching the leaves.
Insert a new region
school
Road1
r
o
a
d
2
House1
House2
o pipeline
pop
Suppose that leaves have room for six regions.
((0,0),(60,50)) ((20,20),(100,80))
Road1 road2 house1 School house2 pipelin pop
Expand a region
school
Road1
r
o
a
d
2
House1
House2
o pipeline
pop
House3
• Expand lower subrange, increase 1000 units• Expand upper subrange increase 1200 units.
Bitmap Indexes
1. A bitmap index for a field F is a collection of bit-vectors of length n (n: number of records).
2.One bit-vector corresponds to each possible value that may appear in the field F.
3.The vector for value v has 1 in position i if the ith record has v in field F, and it has 0 there if not.
An Example of a Bitmap Index
Suppose a file has six records with two fields f and g: (30 , foo), (30,bar),(40,baz),(50,foo),(40,bar),(30,baz)
f : 30:110001 g: foo:100100 40:001010 bar:010010 50:000100 baz:001001
Partial-match queries by bitmap indexes
movie ( title,year,length,studioname)
SELECT title FROM movie WHERE studioname=‘Disney’ AND year=1995
bitwise AND of the bit vector for year = 1995 and the bit vector for studioName = ‘Disney’
Range queries by bitmap indexes Records, 1:(25,60) 2:(45,60) 3:(50,75 ) 4:(50,100) 5:(50,120) 6:(70,140) 7:(85,140) 8:(45,350) Find all records with an age in the range 45 - 55 and a salary in t
he range 100 - 200, using bitmap indexes as follows. Age : 25 ; 10000000 45 : 01000001 50 : 00111000 70 : 00000100 85 : 00000010 Salary : 60 : 11000000 75 : 00100000 100 : 00010000 120 : 00001000 140 : 00000110 350 : 00000001 45 : 01000001 , 50 : 00111000 01000001 OR 00111000 = 01111001
100 : 00010000 , 120 : 00001000 , 140 : 00000110 00010000 OR 00001000 OR 00000110 =00011110
01111001 AND 00011110 = 00011000
Compressed Bitmaps
• Run-length encoding (run: a sequence of i 0’s followed by a 1)
• The number j (log2i) by j-1 1’s and a single 0, followed with i in binary
• Concatenate the codes for each run together.
i=0, 00; i=1, 01i=13, 1110 1101
Encode and Decode
• Encode
age 25: 100000001000
(0,7) 00 110111
• Decode
11101101001011
13, 0, 3 000000000000110001
To perform bitwise AND or OR on encoded bit-vectors
• Decode one run at a time• Determine where the next 1 is in each operand
bit-vector.• If OR, produce 1 at that position of the output;
If AND, produce 1 if and only if both operands have their next 1 at the same position
25: 00110111 30: 110111OR
First Run 0 7
1 in position 1 1 in position 8
Second Run 7
1 in position 9
Result 100000011
Managing Bitmap Indexes
• Finding Bit-Vectors
• Finding Records
• Handling Modifications to the Data File
Finding Bit-Vectors
Use any secondary index with the field value as search key, such as B-tree, hash table or indexed-sequential files.
Finding Records
Use a secondary index on the data file, whose search key is the number of the record.
Handling Modifications to the Data file
• Record numbers must remain fixed once assigned
• Changes to the data file require the bitmap index to change as well
Deletion Record i
• Leave a “ tombstone “ in the data file
• Change the bit-vector in position i from 1 to 0.
Insert New Record
• Assign the next available record number to the new record.
• Modify the bit-vector for the value of the new record by appending a 1 at the end
• Add the new bit-vector for the value which did not appear before.
• Insert the new bit-vector and its corresponding value to the secondary index.
Modification the value of record i from v to w
• Change bit-vector for v in position i from 1 to 0
• Change bit-vector for w in position i from 0 to 1, or create a bit-vector for w if w is a new value.
Conclusion
• Multidimensional Data• Grid files• Partitioned Hash Tables• Multiple-Key Indexes• Kd-Trees• Quad Trees• R-Trees• Bitmap Indexes
Exercises
• Ex 4.1.2, Ex 4.2.6, Ex 4.3.1, Ex 4.4.6
• Ex 5.1.3, Ex 5.2.7, Ex 5.3.2, Ex 5.4.2