36
DBMS 2001 Notes 4.2: Hashing 1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals by Hector Garcia-Molina, Jeff Ullman and Jennifer Widom)

DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

Embed Size (px)

Citation preview

Page 1: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 1

Principles of Database Management Systems

4.2: Hashing Techniques

Pekka Kilpeläinen(after Stanford CS245 slide originals by Hector Garcia-Molina, Jeff Ullman and

Jennifer Widom)

Page 2: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 2

Hashing?

• Locating the storage block of a record by the hash value h(k) of its key k

• Normally really fast– records (often) located by a single

disk access

Page 3: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 3

key h(key)

Hashing

<key>

.

.

Buckets(typically 1disk block)

Page 4: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 4

.

.

.

Two alternatives

records

.

.

.

key h(key)

(1) Hash value determines the storage block directly

• to implement a primary index

Page 5: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 5

key h(key)

Index

recordkey 1

Two alternatives

• for a secondary index

(2) Records located indirectly via index buckets

Page 6: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 6

Example hash function

• Key = ‘x1 x2 … xn’ n byte character string

• Have b buckets• h = (x1 + x2 + … + xn) mod b

{0, 1, …, b-1}

Page 7: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 7

This may not be best function …

Good hash Expected number of function: keys/bucket is the

same for all buckets

Read Knuth Vol. 3 if you reallyneed to select a good function.

Page 8: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 8

Next: example to illustrateinserts, overflows,

deletes

h(K)

Page 9: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 9

EXAMPLE 2 records/bucket

INSERT:h(a) = 1h(b) = 2h(c) = 1h(d) = 0

0

1

2

3

d

ac

b

h(e) = 1

e

Page 10: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 10

0

1

2

3

a

bc

e

d

EXAMPLE: deletion

Delete:ef

fg

maybe move“g” up

cd

Page 11: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 11

Rule of thumb:• Try to keep space utilization

between 50% and 80% Utilization = # keys used

total # keys that fit

• If < 50%, wasting space• If > 80%, overflows significant

depends on how good hashfunction is & on # keys/bucket

Page 12: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 12

How do we cope with growth?

• Overflows and reorganizations• Dynamic hashing: # of buckets

may vary• Extensible• Linear• also others ...

Page 13: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 13

Extensible hashing: two ideas

(a) Use i of b bits output by hash function

b h(K)

use i grows over time….

00110101

For example, b=32

Page 14: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 14

(b) Use directory

h(K)[i ] to bucket

.

.

.

.

Directory contains 2i pointers to buckets, and stores i.

Each bucket stores j, indicating #bits used for placing the records in this block (j i)

Page 15: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 15

Extensible Hashing: Insertion

• If there's room in bucket h(k)[i], place record there; Otherwise …

• If j=i, set i=i+1 and double the directory • If j<i, split the block in two, distribute

records among them now using j+1 bits of h(k); (Repeat until some records end up in the new bucket); Update pointers of bucket array

• See the next example

Page 16: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 16

Example: h(k) is 4 bits; 2 keys/block

i = 1

1

1

0001

1001

1100

Insert 1010

11100

1010

New directory

200

01

10

11

i =

2

2

(j)

Page 17: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 17

10001

21001

1010

21100

Insert:

0111

0000

00

01

10

11

2i =

Example continued

0111

0000

0111

0001

2

2

Page 18: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 18

00

01

10

11

2i =

21001

1010

21100

20111

20000

0001

Insert:

1001

Example continued

1001

1001

1010

000

001

010

011

100

101

110

111

3i =

3

3

Page 19: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 19

Extensible hashing: deletion

• Reverse insert procedure …

• Example: Walk thru insert example in reverse!

Page 20: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 20

Extensible hashing

Can handle growing files- without full reorganizations

Summary

+

Indirection(Not bad if directory in memory)

Directory doubles in size(First it fits in memory, then it does

not sudden performance degradation)

-

-

Only one data block examined+

Page 21: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 21

Linear hashing: grow # of buckets by one

No bucket directory needed

Two ideas:(a) Use i low order bits of hash

01110101grows

b

i(b) File grows linearly

Page 22: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 22

Linear Hashing: Parameters

• n: number of buckets in use– buckets numbered 0…n-1

• i: number of bits of h(k) used to address buckets

• r: number of records in hash table– ratio r/n limited to fit an avg bucket in a block– next example: r 1.7n, and block holds 2 records

=> AVG bucket occupancy is 1.7/2 = 0.85 of a block

)log(ni

Page 23: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 23

Example: 2 keys/block, b=4 bits, n=2, i =1

00 01

1111

0000

1010

If h(k)[i ] = (a1 … ai)2 < n, then

look at bucket h(k)[i ]; else

look at bucket h(k)[i ] - 2i -1 = (0a2 … ai)2

Rule

• insert 0101

0101

• now r=4 >1.7n get new bucket

10and distribute keys btw

buckets 00 and 10

Page 24: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 24

n=3, i =2; distribute keys btw buckets 00 and 10:

00 01 10

0101

1111

0000

1010

1010

Page 25: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 25

n=3, i =2; insert 0001:

00 01 10

0101

1111

0000

0001

1010

• can have overflow chains!

Page 26: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 26

n=3, i =2

00 01 10

0101

1111

0000 1010

0001 • insert 0111

• bucket 11 not in use redirect to 01

0111

• now r=6 > 1.7n-> get new bucket 11

Page 27: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 27

n=4, i =2; distribute keys btw 01 and 11

00 01 10 11

0101

1111

0000 1010

00010111

1111

01110001

Page 28: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 28

Example Continued: How to grow beyond this?

00 01 10 11

111110100101

0101

0000

m = 11 (max used block)

i = 2

0 0 0 0 101 110 111

3

. . .

100

100

101

101

0101

0101

Page 29: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 29

Linear Hashing

Can handle growing files- without full reorganizations

No indirection directory of extensible hashingCan have overflow chains

- but probability of long chains can be kept low by controlling the r/n fill ratio (?)

Summary

+

+

-

Page 30: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 30

Hashing- How it works- Dynamic hashing

- Extensible- Linear

Summary

Page 31: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 31

Next:

• Indexing vs Hashing• Index definition in SQL

Page 32: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 32

• Hashing good for probes given keye.g., SELECT …

FROM RWHERE R.A = 5

Indexing vs Hashing

Page 33: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 33

• INDEXING (Including B-Trees) good for

Range Searches:e.g., SELECT

FROM RWHERE R.A > 5

Indexing vs Hashing

Page 34: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 34

Index definition in SQL

• Create index name on rel (attr)• Create unique index name on rel

(attr)defines candidate key

• Drop INDEX name

Page 35: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 35

CANNOT SPECIFY TYPE OF INDEX(e.g. B-tree, Hashing, …)

OR PARAMETERS(e.g. Load Factor, Size of Hash,...)

... at least in SQL …Oracle and IBM DB2 UDB provide a

PCTFREE clause to inditate the proportion of B-tree blocks initially left unfilled

Oracle: “Hash clusters” with built-in or DBA-specified hash function

Note

Page 36: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals

DBMS 2001 Notes 4.2: Hashing 36

The BIG picture….

• Chapters 2 & 3: Storage, records, blocks...

• Chapter 4: Access Mechanisms- Indexes

- B trees- Hashing

• Chapters 6 & 7: Query ProcessingNEXT