26
1 CS143: Hash Index

Hash

Embed Size (px)

DESCRIPTION

hash introduction

Citation preview

1

CS143: Hash Index

2

What is a Hash Table?

• Hash Table– Hash function

• h(k): key ‡ integer [0…n]• e.g., h(‘Susan’) = 7

– Array for keys: T[0…n]– Given a key k, store it in

T[h(k)]

5

Susan4

James3

2

Neil1

0

h(Susan) = 4h(James) = 3h(Neil) = 1

3

Why Hash Table?

• Direct access• Saved space

– Do not reserve a space forevery possible key

• How can we use this idea fora record search in a DBMS?

4

Hashing for DBMS(Static Hashing)

(key, record)

.

..

Disk blocks (buckets)

search key Æ h(key)

0

1

2

3

4

5

Issues?

• What hash function?• Overflow?• How to store records?

6

Record storage

1. Whole record

2. Key and pointer

.

..

record

.

.

.

recordkey 1

h(key)

h(key)

7

Overflow and Chaining

• Inserth(a) = 1h(b) = 2h(c) = 1h(d) = 0h(e) = 1

• Deleteh(b) = 2h(c) = 1

0

1

2

3

Do we keep keys sorted inside a block?

8

Questions

• Q: Do we keep keys sortedinside a block?

• Q: How much space shouldwe use for “good”performance?

9

What Hash Function?

• Example– Key = ‘x1 x2 … xn’ n byte

character string– Have b buckets– h: (x1 + x2 + ….. xn) mod b

• Desired property– Uniformity: same # of keys for

every bucket• Reference

– “The Art of ComputerProgramming Vol. 3” by Knuth

10

Major Problem ofStatic Hashing

• How to cope with growth?– Data tends to grow in size– Overflow blocks unavoidable

102030

405060

708090

39313536

323834

33

overflow blockshash buckets

11

Idea 1

• Periodic reorganization

– New hash function– Complete reassignment of records– Very expensive

102030

4050…

33

102030

405060

708090

33

405060

12

Idea 2

• Why don’t we split a bucketwhen it overflows?

0 a

3 b

2 d

1 c

e

Insert: h(g) = 1

13

(a) Use i of b bits output byhash function

Extendible Hashing(two ideas)

00110101h(K) Æ

b

use i Æ grows over time

14

(b) Use directory thatmaintains pointers to hashbuckets (indirection)

Extendible Hashing(two ideas)

..

.

.

..

c

e

hash bucketdirectory

h(c)

15

Example• h(k) is 4 bits; 2 keys/bucket

Insert 0111, 1010

1

1

1

0001

1001

1100

0

1

i =

16

Example continued

Insert 0000

1

0001

2

1001

1010

2

1100

00

01

10

11

2i =

0111

17

Example continued

Insert 1011

00

01

10

11

2i =

21001

1010

21100

20111

20000

0001

18

Questions onExtendible Hashing

• Q: What will happen if wehave a lot of duplicate searchkeys?

19

Duplicate keys

• Insert 0001

1

1

1

0001

0

1

i =0001

20

Questions onExtendible Hashing

• Can we provide minimumspace guarantee?

21

Space Waste

2

1

40001

40000

0000

4i =

3

22

Extendible Hashing:Deletion

• Two optionsa) No merging of bucketsb) Merge buckets and shrink

directory if possible

23

Bucket Merge Example

Can we merge a and b? b and c?

Can we shrink the directory?

1

0001

2

1001

2

1100

00

01

10

11

2i =a

b

c

24

Bucket MergeCondition

• Bucket merge condition– Bucket i’s are the same– First (i-1) bits of the hash key

are the same

• Directory shrink condition– All bucket i’s are smaller than

the directory i

25

Summary ofExtendible Hashing

• Can handle growing files– No periodic reorganizations

• Indirection– Up to 2 disk accesses to

access a key

• Directory doubles in size– Not too bad if the data is not

too large

26

Hashing vs. Tree

• Can an extendible-hash indexsupport?

SELECTFROM RWHERE R.A > 5

• Which one is better, B+treeor Extendible hashing?

SELECTFROM RWHERE R.A = 5