Chapter 5 Multidimensional Indexes. One dimensional index can be used to support multidimensional query. F1=‘abcd’ F2= 123‘abcd#123’

Chapter 5

Multidimensional Indexes

One dimensional index can be used to support multidimensional query.

F1=‘abcd’ F2= 123 ‘abcd#123’

Applications Needing Multiple Dimensions

• Geographic Information Systems

• Data Cubes

Geographic Information Systems

In GIS, data are stored in a two-dimensional space such as map.

school

Road1

r

o

a

d

2

House1

House2

o pipeline

Typical Queries of GIS

• Partial match queries

• Range queries

• Near-neighbor queries

• Where-am-I queries

Data Cubes

Data with multiple properties can be seen as existing in a high-dimensional space.

Multidimensional data is gathered by many corporations for decision-support applications

An Example of Data Cube

A chain store may record each sale made, including:• The day and time• The store at which the sale was made• The item purchased• The color of the item• The size of the item• The other properties

Give the sales of pink shirts for each store and each month of 1998

Multidimensional Queries in SQL

Multidimensional data can be stored in a conventional relational database and we can query them in SQL.

Finding the nearest points to (10.0, 20.0)

Store points in the relation Points (x, y) with x and y representing the x- and y-coordinates

SELECT *

FROM POINTS p

WHERE NOT EXISTS(

SELECT *

FROM POINTS q

WHERE (q.x-10)*(q.x-10)+(q.y-20)*(q.y-20)<

(p.x-10)*(p.x-10)+(p.y-20)*(p.y-20)

);

Finding the rectangles that contain (10.0, 20.0)

rectangles （ ID,xll,yll,xur,yur)

SELECT id

FROM rectangles

WHERE xll<=10 AND yll<=20 AND

xur>=10 AND yul>=20 ;

Summarizing the sales of pink shirts

Sales （ day ， store ， item ， color ， size ）

SELECT day, store, count(*) AS totalSales

FROM sales

WHERE item=‘shirt’ AND color=‘pink’

GROUP By day,store;

Executing Range Queries Using Conventional Indexes

• Given ranges in all dimensions, suppose we build a secondary index B+ tree for each dimension. • Using B+ tree for each dimension, we could get pointers to all of records in the range for that dimension. • We intersect these pointers to get final range query results.

The disk I/O for range query includes:

• to find the way down the B-Trees

• to examine leaf nodes of each B-tree

• to retrieve all the matching records

Range query asking for pointers in the square of side 100 surrounding the center of the space

10,000

100,000

1000

1000Disk I/O: 2X(100,000/200+1)+ Number of Data Blocks containing the desired points (at worst 10,000)

Little Help

100

Look at every block of data file

Suppose a leaf node holding 200 key-point pairs, a block holding 100 records

Access the 100,000 pointers in either dimension.

Executing Nearest-Neighbor Queries Using Conventional Indexes

1. picking a range in each dimension2. asking the range query3. selecting the point closest to the target

within that range

Two things that could go wrong:

• No points within distance d of the given point

to repeat the entire process with a higher value of d

• The distance from the target to the closest point d’ > d

to repeat the search with d’ in place of d

*

*Closest point in range

*Possible closer point

Disk I/O to find the nearest neighbor to (10.0, 20.0)

• Pick d = 1• Examine B-tree for the x-coordinate with range

query (10.0-d=9)<=x<=(10.0+d=11)• Get about 2,000 points• Traverse at least 10 leaves, most likely 11• One disk I/O for an intermediate node• Another 12 disk I/O’s for y-coordinate• One more disk to retrieve the desired record• A total of 25 disk I/O’s

Significantly more disk I/O’s

Multidimensional Index Structures

1. Hash-table-like approaches (1) Grid Files (2) Partitioned Hash Functions2. Tree-like approaches (1) Multiple-Key Indexes (2) kd-Trees (3) Quad Trees (4) R-tree3. Bitmap Indexes

Grid Index Key 2

X1 X2 …… Xm

V1 V2

Key 1

Vn

To records with key1=V3, key2=X2

Customers who bought gold jewelry:

*

*

*

* *

***

*

**

*

0 40 55 100

500K

225K

90K

0

Salary

Age

(25,60) (45,60) (50,75) (50,100) (50,120) (70,110) (85,140) (30,260) (25,400) (45,350) (50,275) (60, 260)

• How is Grid Index stored on disk?

Like

Array... X1

X2

X3

X4

X1

X2

X3

X4

X1

X2

X3

X4

V1 V2 V3

Problem:

• Need regularity so we can computeposition of <Vi,Xj> entry

Solution: Use Indirection

BucketsV1

V2

V3 *Grid onlyV4 contains

pointers to buckets

Buckets------

------

------

------

------

X1 X2 X3

The grid file representing database of customers

30,260 25,400

25,60

45,60 50,75

50,100 50,120

45,350 50,275

60,260

70,110 85,140

0-40 40-55 55+

225+

90-225

0-90

Lookup in a Grid File

The positions of the point in each of the dimensions together determine bucket.

Insertion Into Grid Files

Lookup the record; place the new record in

that bucket. If no room, there are two

general approaches as follows:

(1) Add overflow blocks to the bucket.

(2) Reorganize the structure by adding or moving the grid lines

Insertion of the point (52,200) followed by splitting of buckets

*

*

*

* *

****

**

*

0 40 55 100

500K

225K130K90K

0

Salary

Age

*

Performance of Grid Files

• Lookup of Specific Points Read: 1 disk I/O, Insertion/Deletion: 2 disk I/O (+1

if the creation of an overflow block)• Partial-Match Queries Look at all the buckets in a row or column of the

bucket matrix• Range Queries Look at all the buckets that cover the range

defined by range queries.• Nearest-Neighbor Queries Not easy to put an upper bound on how costly the

search is.

Idea:

Key1 Key2

Partitioned hash function

h1 h2

010110 1110010

h1(toy) =0 000h1(sales) =1 001h1(art) =1 010

. 011

.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111

.

.

<Fred,toy,10k>,<Joe,sales,10k><Sally,art,30k>

EX:

Insert

<Joe><Sally>

<Fred>

h1(toy) =0 000h1(sales) =1 001h1(art) =1 010

. 011

.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111

.

.• Find Emp. with Dept. = Sales Sal=40k

<Fred><Joe><Jan>

<Mary>

<Sally>

<Tom><Bill><Andy>

h1(toy) =0 000h1(sales) =1 001h1(art) =1 010

. 011

.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111

.

.• Find Emp. with Sal=30k

<Fred><Joe><Jan>

<Mary>

<Sally>

<Tom><Bill><Andy>

look here

h1(toy) =0 000h1(sales) =1 001h1(art) =1 010

. 011

.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111

.

.• Find Emp. with Dept. = Sales

<Fred><Joe><Jan>

<Mary>

<Sally>

<Tom><Bill><Andy>

look here

Comparison of Grid Files and Partitioned Hashing

• Grid files are good at nearest-neighbor queries or range queries.

• Partitioned hashing is good at partial match queries.

Tree-Like Structure for Multidimensional Data

• Multiple-key indexes

• Kd-trees

• Quad trees

• R-trees

Motivation: Find records where

DEPT = “Toy” AND SAL > 50k

Multi-key Index

Strategy I:

• Use one index, say Dept.

• Get all Dept = “Toy” records and check their salary

I1

• Use 2 Indexes; Manipulate Pointers

Toy Sal> 50k

Strategy II:

• Multiple Key Index

One idea:

Strategy III:

I1

I2

I3

Index on first attribute

Indexes on second attribute

Example

ExampleRecord

DeptIndex

SalaryIndex

Name=JoeDEPT=SalesSAL=15k

ArtSalesToy

10k15k17k21k

12k15k15k19k

Performance of Multiple-Key Indexes

• Partial-Match Queries

quite efficient for the first attribute

• Range Queries

quite well for a range query

• Nearest-Neighbor Queries

the same strategy as the other index structures

Partial-Match Queries

• If the first attribute is specified, the access is quite efficient.

• If the second attribute is specified, the access is time-consuming.

Range Queries

• Range query on the first attribute to find all of the subindexes

• Search each of these subindexes, using the range specified for the second attribute

• ……

Nearest-Neighbor Queries

• Pick a distance d.• Ask range query x0-d<=x<=x0+d and y0-

d<=y<=y0+d.• Find a closest point within this range• If no points within the range or the distance

from (x0,y0) of the closest point greater than d, increase the range and search again.

kd-Trees(k-dimensional search tree)

• Generalization of the binary search tree to multidemensional data.

• Interior nodes with an associated attribute A and its dividing value V.

• The attributes rotating at different levels of the tree.

• Leaves with blocks holding data records.

Salary 150

Age 60 Age 47

Salary 80

70，110

85，140

Age 38

25，60

45，60

50，75

50，100

50，120

Salary 300 50，275

60，260

30，260

25,400

45,350

A kd-tree example

Tree after insertion of (35,500)

Salary 150

Age 60 Age 47

Salary 80

70，110

85，140

Age 38

25，60

45，60

50，75

50，100

50，120

Salary 300 50，275

60，260

30，260

25,400

45,350

年龄 35

25，400 35，500

45，350

Complex Queries on kd-tree

• Partial-Match Queries

ask for all points with age = 50

• Range Queries

ask for all points with ages 35 to 55 and salaries $100K to $200K

• Nearest-Neighbor Queries

use the same approach as discussed before

Partial-Match Queries ( ask for all points with age = 50)

• Explore both ways at the level with the unknown attribute.

• Go one way at the level with the specified attribute.

Range Queries (ask for all points with ages 35 to 55 and

salaries $100K to $200K)

• If the range straddles the splitting value, explore the two children

• Otherwise, move to only one child.

Nearest-Neighbor Queries

• Treat them as range queries

• Repeat with a larger range if necessary

Two approaches to improve• Multiway Branches at Interior Nodes• Group Interior Nodes Into Blocks

Problem: (1) long paths: log2n for a kd-tree with n leaves. (2) unused space: interior nodes with little info.

Multiway Branches at Interior Nodes

• Interior nodes with many key-pointer pairs

• Keeping distribution and balance as we do for B-tree

Group Interior Nodes Into Blocks

• Packing many interior nodes into a single block.

• Including in one block a node and its descendants for some number of levels

Quad Trees

• Data points are contained in a square region.

• If data points in a square can fit in a block, the square will be a leaf of the tree.

• Otherwise, the square will be an interior node, with children corresponding to its four quadrants.

Data organized in a quad tree

400k

*

*

*

*

*

* *

*

* * *

*

0 100

Salary

Age

A quad tree

R-Trees(Region Tree)

• The R-tree node represents a data region which has subregions as its children.

• The data region can be of any shape.

• The subregions do not cover the entire region.

• The subregions are allowed to overlap.

The region of an R-tree node and subregions of its children

“Where-am-I” Query

• Start at the root.

• Examine the subregions at the root to see whether they contain point P

• If there are zero regions, P is not in any data region;

If there is at least one interior region that contains P, recursively search for P until reaching the leaves.

Insert a new region

school

Road1

r

o

a

d

2

House1

House2

o pipeline

pop

Suppose that leaves have room for six regions.

((0,0),(60,50)) ((20,20),(100,80))

Road1 road2 house1 School house2 pipelin pop

Expand a region

school

Road1

r

o

a

d

2

House1

House2

o pipeline

pop

House3

• Expand lower subrange, increase 1000 units• Expand upper subrange increase 1200 units.

Bitmap Indexes

1. A bitmap index for a field F is a collection of bit-vectors of length n (n: number of records).

2.One bit-vector corresponds to each possible value that may appear in the field F.

3.The vector for value v has 1 in position i if the ith record has v in field F, and it has 0 there if not.

An Example of a Bitmap Index

Suppose a file has six records with two fields f and g: (30 ， foo), (30,bar),(40,baz),(50,foo),(40,bar),(30,baz)

f ： 30:110001 g: foo:100100 40:001010 bar:010010 50:000100 baz:001001

Partial-match queries by bitmap indexes

movie （ title,year,length,studioname)

SELECT title FROM movie WHERE studioname=‘Disney’ AND year=1995

bitwise AND of the bit vector for year = 1995 and the bit vector for studioName = ‘Disney’

Range queries by bitmap indexes Records, 1:(25,60) 2:(45,60) 3:(50,75 ） 4:(50,100) 5:(50,120) 6:(70,140) 7:(85,140) 8:(45,350) Find all records with an age in the range 45 － 55 and a salary in t

he range 100 － 200, using bitmap indexes as follows. Age ： 25 ； 10000000 45 ： 01000001 50 ： 00111000 70 ： 00000100 85 ： 00000010 Salary ： 60 ： 11000000 75 ： 00100000 100 ： 00010000 120 ： 00001000 140 ： 00000110 350 ： 00000001 45 ： 01000001 , 50 ： 00111000 01000001 OR 00111000 = 01111001

100 ： 00010000 ， 120 ： 00001000 ， 140 ： 00000110 00010000 OR 00001000 OR 00000110 =00011110

01111001 AND 00011110 ＝ 00011000

Compressed Bitmaps

• Run-length encoding (run: a sequence of i 0’s followed by a 1)

• The number j (log2i) by j-1 1’s and a single 0, followed with i in binary

• Concatenate the codes for each run together.

i=0, 00; i=1, 01i=13, 1110 1101

Encode and Decode

• Encode

age 25: 100000001000

(0,7) 00 110111

• Decode

11101101001011

13, 0, 3 000000000000110001

To perform bitwise AND or OR on encoded bit-vectors

• Decode one run at a time• Determine where the next 1 is in each operand

bit-vector.• If OR, produce 1 at that position of the output;

If AND, produce 1 if and only if both operands have their next 1 at the same position

25: 00110111 30: 110111OR

First Run 0 7

1 in position 1 1 in position 8

Second Run 7

1 in position 9

Result 100000011

Managing Bitmap Indexes

• Finding Bit-Vectors

• Finding Records

• Handling Modifications to the Data File

Finding Bit-Vectors

Use any secondary index with the field value as search key, such as B-tree, hash table or indexed-sequential files.

Finding Records

Use a secondary index on the data file, whose search key is the number of the record.

Handling Modifications to the Data file

• Record numbers must remain fixed once assigned

• Changes to the data file require the bitmap index to change as well

Deletion Record i

• Leave a “ tombstone “ in the data file

• Change the bit-vector in position i from 1 to 0.

Insert New Record

• Assign the next available record number to the new record.

• Modify the bit-vector for the value of the new record by appending a 1 at the end

• Add the new bit-vector for the value which did not appear before.

• Insert the new bit-vector and its corresponding value to the secondary index.

Modification the value of record i from v to w

• Change bit-vector for v in position i from 1 to 0

• Change bit-vector for w in position i from 0 to 1, or create a bit-vector for w if w is a new value.

Conclusion

• Multidimensional Data• Grid files• Partitioned Hash Tables• Multiple-Key Indexes• Kd-Trees• Quad Trees• R-Trees• Bitmap Indexes

Exercises

• Ex 4.1.2, Ex 4.2.6, Ex 4.3.1, Ex 4.4.6

• Ex 5.1.3, Ex 5.2.7, Ex 5.3.2, Ex 5.4.2

Documents

Chapter 5 Multidimensional Indexes. One dimensional index can be used to support multidimensional query. F1=‘abcd’ F2= 123‘abcd#123’