27
1 Processing Queries with Processing Queries with Bit-Vector Indexes Bit-Vector Indexes Originally presented by Anand Deshpande

Bit Vectors Siddhesh

Embed Size (px)

DESCRIPTION

Introduction to Bit Vectors

Citation preview

1

Processing Queries with Processing Queries with Bit-Vector IndexesBit-Vector Indexes

Originally presented byAnand Deshpande

2

MotivationMotivation

Consider the following SQL query

SELECT name, addressFROM studentsWHERE Dept = ‘CSCI’AND Hostel = ‘H2’

To process this query – use a complete scan– use index

3

Example students table Example students table

1

23456789

1011121314

CS

CS

CSCS

CS

CS

EEME

EE

EE

ME

ME

AE

AE

M

M

M

M

M

MM

M

F

F

FF

F

F

H1

H1

H2

H2H2

H1

H1

H2

H3

H3

H3

H4

H4

H4

Abhay Athavale

Bina BajajChinmay Chatterjee

David DeMilloEra EdkeFrank Fernandez

Gauri GaikwadHari HateIndira IraniJaya JoshiKader Khan

Leo Lobo

Meera Malik

Naresh Naik

RID Name Hostel Gender Dept

4

Using an Index to Process Using an Index to Process QueriesQueries Find all records (rids) that match

– Dept = ‘CS’– Hostel = ‘H2’

Intersect the two set of rids Given the rid get the name and address

In the presence of an index -- – FIND is of log order O(log(N)) and – intersects on sorted rids is O(n1 + n2)

5

Processing Queries Processing Queries

1

23456789

1011121314

CS

CS

CSCS

CS

CS

EEME

EE

EE

ME

ME

AE

AE

M

M

M

M

M

MM

M

F

F

FF

F

F

H1

H1

H2

H2H2

H1

H1

H2

H3

H3

H3

H4

H4

H4

Abhay AthavaleBina BajajChinmay Chatterjee

David DeMilloEra EdkeFrank Fernandez

Gauri GaikwadHari HateIndira IraniJaya JoshiKader KhanLeo LoboMeera MalikNaresh Naik

RID Name Hostel Gender Dept

Dept = CS{1, 5, 7, 8, 12, 14}

Hostel = H2{6, 8,11,12}

Dept = CS Hostel = H2{8, 12}

6

What does an Index do?What does an Index do?

Index provides a mapping from Value to a Set of Records (RIDs)

Given a value -- tell me records that have that value

Various kinds of indices– B-Tree– Hash Index– R-Tree

7

B-Tree IndexB-Tree Index

AECS EE

ME

{4, 6} {1,5,7,8,12,14} {2, 9, 11} {3, 10, 13}

B-Tree Index for Department

List of RIDs

8

Index ArchitectureIndex ArchitectureV

alu

e-b

ase

d In

de

x(B

-Tre

e)

1

2

34

5

6

7

8

9

10

11

12

13

14

CS

CS

CS

CS

CS

CS

EE

ME

EE

EE

ME

ME

AE

AE

M

M

M

M

M

M

M

M

F

F

F

F

F

F

H1

H1

H2

H2

H2

H1

H1

H2

H3

H3

H3

H4

H4

H4

RID

Ho

ste

l

Gen

der

Dep

t

List

s of

RID

s

Va

lue

-based

Inde

x(B

-Tre

e)

Lists of RID

s

9

Selectivity of DomainsSelectivity of Domains

Domain is strongly selective if the number of rids for the value is small– example -- primary key– only 1 rid for each value

Domain is weakly selective if the number of rids for the value is large– example -- gender -- male/female– .5 * table_size rid for each value

10

Motivating Bit-VectorsMotivating Bit-Vectors

In queries with constraints on many weakly selective domains, rid intersection costs dominate the cost equations.

AND/OR-ing bit-vectors is an efficient strategy instead of intersection/union of sets

11

Bit-Vector Representation of Bit-Vector Representation of SetsSets

123456789

101112

537432462054

RID Score 75 643210

1

1

1

1

1

1

1

11

1

1

1000000000000

00000

00

000

000000000

00

0

00

0000000

000

00

0000

000000000

0

0000000

0000

00

000000000

0 -- {10}

1 - {}

2 - {6,9}

3 - {2,5}

4 - {4,7,12}

5 - {1,11}

6 - {8}

7 - {3}

Score between 4 and 6 -- S4 U S5 U S6V4 or V5 or V6

12

Range Encoded Bit-VectorsRange Encoded Bit-Vectors

123456789

101112

537432462054

RID Score 75 643210

1

1

1

1

1

1

1

11

1

1

1000000000100

00000

00

100

000000000

00

0

00

1001100

010

11

0110

101111011

1

1101111

1111

11

111111111

Score = 4V4 V3

Score <= 4V4

Score >= 2 andScore <= 4V4 V1

13

Bit-VectorsBit-Vectors

1

23456789

1011121314

CS

CS

CSCS

CS

CS

EEME

EE

EE

ME

ME

AE

AE

M

M

M

M

M

MM

M

F

F

FF

F

F

H1

H1

H2

H2H2

H1

H1

H2

H3

H3

H3

H4

H4

H4

RID

Hos

tel

Gen

der

Dep

t

AE

CS

EE

ME

Dept

H1

H2

H3

H4

Hostel

M F

Gender

14

Merging RecordIdsMerging RecordIds

SELECT name, addressFROM studentsWHERE Dept = ‘CS’AND Hostel = ‘H2’

What is better?– Bit-wise AND or Intersection

Depends on how many records in each set– if the sets are very small, record-id

intersection will be faster than bit-wise and

15

Processing QueriesProcessing Queries

Dept = CSCI

Hostel =H2

Record-Set

Record-Set

Record-Set

Dept = CSCI

Hostel =H2

•Bit-vector Bit-vector

Bit-vector

Dept = CSCI

Bit-Vector

Bit-Vector

Record-Set

Hostel =H2

convert

16

Processing QueriesProcessing Queries

Convert from bit-vector to record-ids and vice-versa

For a record-id probe into the bit-vector Fast counting of bits to get counts --

extend to sum and average Skip empty blocks Deal with NULL values

17

N-way AND and ORsN-way AND and ORs

Dept = CSCI

Hostel =H2

Age =19

Age =20

Age =21

Age =22

+

Early Exit Strategy

18

Where are Bit-Vectors goodWhere are Bit-Vectors good

Equality predicate– select * from customer where state = ‘CA’

AND predicates– select * from customer where state = ‘CA’ and gender =

‘F’

OR predicates– select * from customer where state = ‘CA’ or state is

NULL

Queries with Negation– select * state from customer where state <> ‘CA’ and

age between 30 and 40

19

Aggregate QueriesAggregate Queries

Select count(*) from customer select count(age) from customer where

state = ‘CA’ select state, count(*) from customer group

by state

21

Bit-Vector IndicesBit-Vector Indices

The structure that maps value to record-id is the same

The Record List Area stores bit-vectors rather than record lists

22

Comparing Space Comparing Space RequirementsRequirements Consider a table with N (1M) values Consider an index on a domain with n

(100) values The value-based index is identical in both

cases

23

Calculating SpaceCalculating Space

Record-Id– 1 million records ids.– 32 bits * 1M records– 32M bits– 1 M words– N words

Per value– 100 values, (1,000,000/100

= 10,000) rids per value

Bit-Vectors– 1 million bits per value– 100 values – 100 * 1,000,000 bits =

100 M bits– ~ 3M words– N * n/32 words

n -- number of distinct values

N -- number of records

24

Can we do better?Can we do better?

For small domains ( < 32) bit-vectors are space efficient

For large domains, bit-vectors are sparse For very large domains, record-ids are the

best compression

small domains -- bit vectors, medium domains -- compress, large domains -- record-ids

25

Handling SkewHandling Skew

For many domains a large portion of values correspond to a few distinct values

Even though the number of unique values is large some domains are candidates for bit-vectors

Compression of bits to reduce space Dynamic selection of encoding strategy

26

CompressionCompression

Compressing bit-streams of 1s run-length encoding

– 111100001111 (1:4:9:4)– works well with large runs– for very large blocks of zeros, don’t store

anything

must deal with runs as “one” object

27

Inserts/Deletes UpdatesInserts/Deletes Updates

Delete and insert may require toggling a bit

However, if the number of rows increases, each

bitmap needs to be extended

Don’t map bits to rows but to blocks– shrinks the size of the bit-vector and more bits set --

better compression possible

– Does not do precise computation, can’t deal with NOT,

NULL etc.

31

ReferencesReferences

Don Kunth, Volume 3 (Sorting and Searching) Red Brick Systems, White paper on Target

Indices Graefe, Query Processing Survey Hakan Jakobsson Oracle Bit-Vector Indices

presentation Graefe and O’Neil Sigmod Records Chee-Yong Chan, Yannis Ioannidis, SIGMOD 1998