36
Bitmap Indexing

Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Embed Size (px)

Citation preview

Page 1: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Bitmap Indexing

Page 2: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Bitmap Index Bitmap index: specialized index that takes

advantage Read-mostly data: data produced from scientific

experiments can be appended in large groups Fast operations

“Predicate queries” can be performed with bitwise logical operations• Predicate ops: =, <, >, <=, >=, range,• Logical ops: AND, OR, XOR, NOT

They are well supported by hardware Easy to compress, potentially small index size Each individual bitmap is small and frequently

used ones can be cached in memory

Page 3: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Bitmap Index

Can be useful for stable columns with few values

Bitmap: String of bits: 0 (no match) or 1 (match) One bit for each row

Bitmap index record Column value Bitmap DBMS converts bit position into row identifier.

Page 4: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Bitmap Index Example

RowId FacSSN … FacRank 1 098-55-1234 Asst 2 123-45-6789 Asst 3 456-89-1243 Assc 4 111-09-0245 Prof 5 931-99-2034 Asst 6 998-00-1245 Prof 7 287-44-3341 Assc 8 230-21-9432 Asst 9 321-44-5588 Prof 10 443-22-3356 Assc 11 559-87-3211 Prof 12 220-44-5688 Asst

FacRank Bitmap Asst 110010010001 Assc 001000100100 Prof 000101001010

Faculty Table

Bitmap Index on FacRank

Page 5: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Compressing Bitmaps:Run Length Encoding

Bit vector for Assc: 0010000000010000100

Runs: 2, 8, 4 Can we do: 10 1000 100? Is this

unambiguous? Fixed bits per run (max=4)

0010 1000 0100 Variable bits per run

1010 11101000 110100 11101000 is broken into: 1110 (#bits),

1000 (value)• i.e., 4 bits are required, value is 8

Page 6: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Operation-efficient Compression Methods

Uncompressed:0000000000001111000000000 ......0000001000000001111111100000000 .... 0000100

Compressed:12, 0, 0, 0, 0, 1000, 8, 0, 0, 0, 0, 0, 0, 0, 0, 1000

Store very short sequences as-is

AND/OR/COUNT operations:Can uncompress on the fly

Based on variations of Run Length Compression

Page 7: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Indexing High-Dimensional Data

Typically, high-dimensional datasets are collections of points, not regions. E.g., Feature vectors in multimedia applications. Very sparse

Nearest neighbor queries are common. R-tree becomes worse than sequential scan for most

datasets with more than a dozen dimensions.

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases; “nearest neighbor” is not meaningful. In any given data set, advisable to empirically test contrast.

Page 8: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

External Sorting

Page 9: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Why Sort? A classic problem in computer science! Data requested in sorted order

e.g., find students in increasing gpa order Sorting is first step in bulk loading B+ tree

index. Sorting useful for eliminating duplicate

copies in a collection of records (Why?) Sort-merge join algorithm involves sorting. Problem: sort 1Gb of data with 1Mb of

RAM. why not virtual memory?

Page 10: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

2-Way Sort: Requires 3 Buffers

Pass 1: Read a page, sort it, write it. only one buffer page is used

Pass 2, 3, …, etc.: three buffer pages used.

Main memory buffers

INPUT 1

INPUT 2

OUTPUT

DiskDisk

Page 11: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Two-Way External Merge Sort

Each pass we read + write each page in file.

N pages in the file => the number of passes

So total cost is:

Idea: Divide and

conquer: sort subfiles and merge

log2 1N

2 12N Nlog

Input file

1-page runs

2-page runs

4-page runs

8-page runs

PASS 0

PASS 1

PASS 2

PASS 3

9

3,4 6,2 9,4 8,7 5,6 3,1 2

3,4 5,62,6 4,9 7,8 1,3 2

2,34,6

4,7

8,91,35,6 2

2,3

4,46,7

8,9

1,23,56

1,22,3

3,4

4,56,6

7,8

Page 12: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

General External Merge Sort

To sort a file with N pages using B buffer pages: Pass 0: use B buffer pages. Produce

sorted runs of B pages each. Pass 2, …, etc.: merge B-1 runs.

N B/

B Main memory buffers

INPUT 1

INPUT B-1

OUTPUT

DiskDisk

INPUT 2

. . . . . .

. . .

More than 3 buffer pages. How can we utilize them?

Page 13: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Cost of External Merge Sort

Number of passes: Cost = 2N * (# of passes) E.g., with 5 buffer pages, to sort 108 page

file: Pass 0: = 22 sorted runs of 5 pages

each (last run is only 3 pages) Pass 1: = 6 sorted runs of 20 pages

each (last run is only 8 pages) Pass 2: 2 sorted runs, 80 pages and 28 pages Pass 3: Sorted file of 108 pages

1 1 log /B N B

108 5/

22 4/

Page 14: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Number of Passes of External Sort

N B=3 B=5 B=9 B=17 B=129 B=257100 7 4 3 2 1 11,000 10 5 4 3 2 210,000 13 7 5 4 2 2100,000 17 9 6 5 3 31,000,000 20 10 7 5 3 310,000,000 23 12 8 6 4 3100,000,000 26 14 9 7 4 41,000,000,000 30 15 10 8 5 4

Page 15: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Using B+ Trees for Sorting Scenario: Table to be sorted has B+ tree

index on sorting column(s). Idea: Can retrieve records in order by

traversing leaf pages. Is this a good idea? Cases to consider:

B+ tree is clustered Good idea! B+ tree is not clustered Could be a very

bad idea!

Page 16: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Clustered B+ Tree Used for Sorting

Cost: root to the left-most leaf, then retrieve all leaf pages (data pages)

If leaves contain pointers? Additional cost of retrieving data records: each page fetched just once.

Always better than external sorting!

(Directs search)

Data Records

Index

Data Entries("Sequence set")

Page 17: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Unclustered B+ Tree Used for Sorting

Each data entry contains rid of a data record. In general, one I/O per data record!

(Directs search)

Data Records

Index

Data Entries("Sequence set")

Page 18: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Query Processing

Page 19: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Relational Operations We will consider how to implement:

Selection ( ) Selects a subset of rows from relation. Projection ( ) Deletes unwanted columns from

relation. Join ( ) Allows us to combine two relations. Set-difference ( ) Tuples in reln. 1, but not in reln. 2. Union ( ) Tuples in reln. 1 and in reln. 2. Aggregation (SUM, MIN, etc.) and GROUP BY

Since each op returns a relation, ops can be composed! After we cover the operations, we will discuss how to optimize queries formed by composing them.

Page 20: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Schema for Examples

Bids: Each tuple is 40 bytes long, 100 tuples per

page, 1000 pages. Buyers:

Each tuple is 50 bytes long, 80 tuples per page, 500 pages.

Buyers(id: integer, name: string, rating: integer, age: real)Bids (bid: integer, pid: integer, day: dates, product: string)

Page 21: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

21

Access Path Algorithm + data structure used to

locate rows satisfying some condition File scan: can be used for any condition Hash: equality search; all search key

attributes of hash index are specified in condition

B+ tree: equality or range search; a prefix of the search key attributes are specified in condition

Binary search: Relation sorted on a sequence of attributes and some prefix of sequence is specified in condition

Page 22: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Access Paths A tree index matches (a conjunction of)

terms that involve only attributes in a prefix of the search key. E.g., Tree index on <a, b, c> matches the

selection a=5 AND b=3, and a=5 AND b>6, but not b=3.

A hash index matches (a conjunction of) terms that has a term attribute = value for every attribute in the search key of the index. E.g., Hash index on <a, b, c> matches a=5 AND

b=3 AND c=5; but it does not match b=3, or a=5 AND b=3, or a>5 AND b=3 AND c=5.

Page 23: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

23

Access Paths Supported by B+

tree Example: Given a B+ tree whose search key

is the sequence of attributes a2, a1, a3, a4 Access path for search a1>5 a2=3.0 a3=‘x’ (R):

find first entry having a2=3.0 a1>5 a3=‘x’ and scan leaves from there until entry having a2>3.0 . Select satisfying entries

Access path for search a2=3.0 a3 >‘x’ (R): locate first entry having a2=3.0 and scan leaves until entry having a2>3.0 . Select satisfying entries

No access path for search a1>5 a3 =‘x’ (R)

Page 24: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

24

Choosing an Access Path Selectivity of an access path refers to its

cost Higher selectivity means lower cost (#pages)

If several access paths cover a query, DBMS should choose the one with greatest selectivity

Size of domain of attribute is a measure of the selectivity of domain

Example: CrsCode=‘CS305’ Grade=‘B’ - a B+ tree with search key CrsCode is more selective than a B+ tree with search key Grade

Page 25: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

25

Computing Selection

No index on attr: If rows unsorted, cost = M

• Scan all data pages to find rows satisfying the condition

If rows sorted on attr, cost = log2 M + (cost of scan)• Use binary search to locate first data page containing row

in which (attr = value)• Scan further to get all rows satisfying (attr op value)

condition: (attr op value)

Page 26: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

26

Computing Selection

B+ tree index on attr (for equality or range search): Locate first index entry corresponding to a row

in which (attr = value); cost = depth of tree Clustered index - rows satisfying condition

packed in sequence in successive data pages; scan those pages; cost depends on number of qualifying rows

Unclustered index - index entries with pointers to rows satisfying condition packed in sequence in successive index pages; scan entries and sort pointers to identify table data pages with qualifying rows, each page (with at least one such row) fetched once

condition: (attr op value)

Page 27: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

27

Unclustered B+ Tree Index

B+ Tree

Index entriessatisfyingcondition

Data File

data page

Page 28: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

28

Computing Selection

Hash index on attr (for equality search only): Hash on value; cost 1.2 (to account for

possible overflow chain) to search the (unique) bucket containing all index entries or rows satisfying condition• Unclustered index - sort pointers in index

entries to identify data pages with qualifying rows, each page (containing at least one such row) fetched once

condition: (attr = value)

Page 29: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

29

Complex Selections

Conjunctions: a1 =x a2 <y a3=z (R) Use most selective access path Use multiple access paths

Disjunction: (a1 =x or a2 <y) and (a3=z) (R) DNS (disjunctive normal form) (a1 =x a3 =z) or (a2 < y a3=z) Use file scan if one disjunct requires file scan If better access paths exist, and combined

selectivity is better than file scan, use the better access paths, else use a file scan

Page 30: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Two Approaches to General Selections

First approach: Find the most selective access path, retrieve tuples using it, and apply any remaining terms that don’t match the index: Most selective access path: An index or file scan that we

estimate will require the fewest page I/Os. Terms that match this index reduce the number of tuples

retrieved; other terms are used to discard some retrieved tuples, but do not affect number of tuples/pages fetched.

Consider day<8/9/94 AND pid=5 AND bid=3. A B+ tree index on day can be used; then, pid=5 and bid=3 must be checked for each retrieved tuple. Similarly, a hash index on <pid, bid> could be used; day<8/9/94 must then be checked.

Page 31: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Intersection of Rids Second approach (if we have 2 or more

matching indexes that use pointers to data entries): Get sets of rids of data records using each matching

index. Then intersect these sets of rids Retrieve the records and apply any remaining terms. Consider day<8/9/94 AND pid=5 AND bid=3. If we

have a B+ tree index on day and an index on bid, we can retrieve rids of records satisfying day<8/9/94 using the first, rids of recs satisfying bid=3 using the second, intersect, retrieve records and check pid=5.

Page 32: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Projection

The expensive part is removing duplicates. SQL systems don’t remove duplicates unless the keyword DISTINCT

is specified in a query. Sorting Approach: Sort on <bid, pid> and remove

duplicates. (Can optimize this by dropping unwanted information while sorting.)

Hashing Approach: Hash on <bid, pid> to create partitions. Load partitions into memory one at a time, build in-memory hash structure, and eliminate duplicates.

If there is an index with both B.pid and B.bid in the search key, scan the index!

SELECT DISTINCT B.bid, B.pidFROM Bids B

Page 33: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

33

Projection based on Sorting: Duplicate Elimination

Needed in computing projection and union

Sort-based projection: Sort rows of relation at cost of 2M Log B-1M Eliminate unwanted columns in first scan

(no cost): Log B-1M’ (M’ is projected file size) Eliminate duplicates on completion of last

merge step (no cost) Final Cost: M + 2M’ Log B-1M’

Page 34: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

34

Duplicate Elimination Hash-based projection

Phase 1: input rows, project, hash remaining columns using an B-1 output hash function, and create B-1 buckets on disk, cost = M (original file size) + M’ (projected file size)

Phase 2: sort each bucket to eliminate duplicates, cost (assuming a bucket fits in memory) = 2M’

Total cost = M+3M’

Page 35: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Projection Based on Hashing Partitioning phase: Read R using one input buffer.

For each tuple, discard unwanted fields, apply hash function h1 to choose one of B-1 output buffers. Result is B-1 partitions (of tuples with no unwanted fields).

2 tuples from different partitions guaranteed to be distinct. Duplicate elimination phase: For each partition,

read it and build an in-memory hash table, using hash fn h2 (<> h1) on all fields, while discarding duplicates. If partition does not fit in memory, can apply hash-based

projection algorithm recursively to this partition. Cost: For partitioning, read R, write out each tuple,

but with fewer fields. This is read in next phase.

Page 36: Bitmap Indexing. Bitmap Index uBitmap index: specialized index that takes advantage wRead-mostly data: data produced from scientific experiments can be

Discussion of Projection Sort-based approach is the standard; better

handling of skew and result is sorted. If an index on the relation contains all wanted

attributes in its search key, can do index-only scan. Apply projection techniques to data entries (much

smaller!)

If an ordered (i.e., tree) index contains all wanted attributes as prefix of search key, can do even better: Retrieve data entries in order (index-only scan), discard

unwanted fields, compare adjacent tuples to check for duplicates.