© Neeraj Suri EU-NSF ICT March 2006 Dependable Embedded Systems & SW Group Introduction to Computer Science 2 Hash

© Neeraj SuriEU-NSF ICT March 2006

Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de

Introduction to Computer Science 2

Hash Tables (2)

Prof. Neeraj SuriDan Dobre

ICS-II - 2008 2Hash Tables (2)

Overview

So far: Direct hashing Hash functions (folding, modulo etc.) Collision resolution (linear & quadratic probing)

What’s next? Collision resolution continued Cost analysis of hashing Hashing on external memory Extendible (dynamic) hashing Excursus: (pseudo-)random numbers and their application


Double/repeated Hashing

If a collision occurs the key is hashed a second time using another Hash function.

Can be generalized: if a collision occur, the key is hashed again using the next Hash function.

If the collision after using k Hash functions persists, another technique has to be applied.

Avoids collision accumulation, delete remains complex, accessibility of the entire memory space is problematic


Chaining of synonyms in the same HT

Members of a collision class are chained. Each memory slot in HT must have an additional pointer. Because there is no separate overflow area, collisions

continue to occur due to foreign occupation. Chaining doesn’t prevent the collisions, however it

facilitates the search. Delete becomes considerably easier, because only one

pointer have to be reset. Insert requires to follow the pointer list, until a free place

is found. If the home address is occupied by another key (which

does not belong there), move it.


Chaining: Example

h0 (K) = K mod 7; hi (K) = (h0 (K) + i) mod m

Insert: 11, 32, 8, 25

0 1 2 3 4 5 6

11 325

8 256


Chaining: Example


Now insert 12

Move 32: search left for pointer, then move further to position 0.

0 1 2 3 4 5 6

8 11 32 255 6


Chaining: Example


Now insert 12 in its home address

0 1 2 3 4 5 6

32 8 11 256 0


Chaining: Example

h0 (K) = K mod 7; hi (K) = (h0 (K) + i) mod m Delete 11:

Follow chain until 25 is reached (4-0-6) Move 25 to its home address 4 Delete pointer “6” in address 0

0 1 2 3 4 5 6

32 8 11 12 256 0


Chaining: Example


Collision chain until 32 is now broken (empty address 6) But this is not a problem since pointers are used for

chaining

0 1 2 3 4 5 6

32 8 25 120


Chaining with separate overflow

All records, which can not be stored in the own home address, are transferred to an overflow area.

Overflow area can be: A single overflow for all synonyms with only one entry point

• simple, avoid having pointers in the Hash table • possibly long synonym chains, therefore only suitable with small

collision frequency A single overflow with more than one entry point

• efficient, since only members of a collision class are browsed• requires pointer for each entry in Hash table• reference to synonym chain can be implemented using double

Hashing in the case of collisions synonyms (mostly few) of 2 collision classes are affected


Chaining with separate overflow

Separate overflow area can be assigned dynamically HT can be restricted to the keys in the home address, all

data can be stored in the dynamic overflow area. Since pointers can refer to any address, this corresponds

to a partition of the overflow Chaining of synonyms is a preferred method

Position Key Pointer

0 HAYDN HAENDEL VIVALDI 1 BEETHOVEN BACH BRAHMS 2 CORELLI 3 4 SCHUBERT LISZT 5 MOZART 6


Hashing: analysis of the costs

Cost measure: Number of steps (addressing attempts)

Assumption: The same time effort for all h(Kp) and search steps The Hash table is allocated with n keys

Search costs Sn = delete costs without rearrangement

Insert costs = unsuccessful search Un

Delete costs = Sn + rearrangement Rn

Costs can be expressed as function of the allocation factor = n/m


Hashing: analysis of the costs – extreme cases

Worst case: Sn = n

Un = n + 1 One collision class, access as in linear list

Best case: Sn = 1

Un = 1 No collisions


Hashing: analysis of the costs – average cases

Average case depends on overflow handling

Assumption: h(Kp) distributes keys uniformly

-> Probability, that a key a Hash value 0 i m-1 has, is 1/m


Costs using linear probing

Example hi(k) = (h0(k)+i) mod m In the case of small allocation of HT, no problem In the case of higher allocation, drastic degradation

Probability p, that 7 will be allocated is 1/m because 6 is free Probability that 14 will be allocated is 5/m (the p for 14 as home

address plus the sum of the p for 10,11,12,13, which can produce an overflow on 14)

Long chains will be longer and chains can grow together (insert in 3 or 14)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


Costs using linear probing

According to KnuthSn = 0.5(1 + 1/(1- ))

with 0 = n/m < 1

Un = 0.5(1 + 1/(1- )2)

0.1 0.3 0.5 0.7 0.9

8

7

6

5

4

3

2

1

Sn

Un

Number of search steps increases drastically with higher allocationfactor

Steps


Costs using optimal collision resolution

With optimal methods for collision resolution a uniform distribution can be approximately assumed despite collision E.g. : rehashing, pseudo-random numbers etc.

Probability that a place is occupied/free depends on the number of the already allocated places (n) and on the ones, that are still available (m-n) E.g. : Pfree = (m-n)/m

See script for details of the derivative


Costs using optimal collision resolution (2)

ApproximatelySn ~ |(1/ ) ln(1- )|

with 0 = n/m < 1Un ~ 1/(1- )

0.1 0.3 0.5 0.7 0.9

8

7

6

5

4

3

2

1

Sn

UnNumber of search steps can improvedrastically with independentallocation after collision resolution

Steps


Costs using separate overflow

Assumption: Uniform distribution of the keys over all chains n/m = Keys per chain, furthermore linear chaining (Q: how big is Sn?)

If key i is inserted in HT, then i-1 keys are in the table and in each chain (i-1)/m keys

Costs to find a free place are 1 step for home address plus (i-1)/m steps to reach end of the chain (must first see, if the key already exists in table or not)

Averaged over all n keys

Sn = 1/n i=1...n(1 + (i-1)/m) = 1+(n-1)/2m ~ 1+ /2


Costs using separate overflow

For successful search half of the chain will be traversed in average

For unsuccessful search the entire chain has to be traversed

Chaining is superior to other methods, even with high overflow ( >1) good efficiency

0.5 0.75 1 1.5 2 3 4 5

Sn 1.25 1.37 1.5 1.75 2 2.5 3 3.5

Un 1.11 1.22 1.37 1.72 2.14 3.05 4.02 5.01


Hashing on external memory (b>1)

With bucket factor > 1, b records can be stored in one address

For both main and external memory suitable, particularly attractive with external memory

During collision the new record will simply be stored in the same bucket

First within b+1 entries bucket overflows

Having overflow the known methods for collision resolution can be applied Overflow in primary area Separate overflow area


Hashing on external memory

Overflow bucket can be assigned dynamically and interlinked with overflow address

An overflow bucket can serve for several home addresses as overflow area

Recommended: one chain per collision class

With b>1 is =n/bm

Sequence for storing records in bucket: According to the insert sequence (sequential) According to the sorting sequence (linked list)


Hashing on external memory

Typical bucket size: Sector Track Page

Generally: Transfer unit (1 I/O per bucket)

Like B-Trees: I/O dominates (approx. 6-10 ms) more complex Hash function justified Relative search costs inside one bucket are low

Insert always at first free space in chain

While deletion, no need to bridge gaps (or only inside a page)

Empty overflow buckets are removed from chain


Example: b=2

b=2; h(k) = k mod 7

Insert: 11, 32, 8, 25, 21, 15, 2, 18, 13, 20, 4, 27

0 1 2 3 4 5 6


Example: b=2 (2)

Now: delete 25

0 1 2 3 4 5 6

21 8 2 11 13

15 32 20

25 27

18

4


Example: b=2 (3)

Chains will not be closed!Inside of a page will berearranged if needed.

0 1 2 3 4 5 6

21 8 2 11 13

15 32 20

18 27

4


Summary: Hashing on external memory

Primary buckets remain always assigned because of relative addressing

Overflow buckets will be assigned dynamically (append), delete empty buckets

With strong negative growth, buckets possibly understaffed (reorganization of the file, e.g. using rehashing of all entries stored in the hash table)


Approximate values for Hashing

Selected values for Sn(b) and Un(b) as function of b and β

Rule of thumb: b is typically determined by data transfer unit, select β in such a way, that

S ~ 1.05 to 1.08 holds


Hashing vs. B+-Tree

Access costs with a good designed Hash method better than B+-Tree (1.05 vs. path length)

Disadvantages: no sorting of all keys (sequential output needs an obviously

higher cost) Hashing is static

• Not extendable, long chains lead to degenerations• Consumes already with a small number of keys the complete

designated memory space• (can also be an advantage: the required memory space is defined to

a large extent from the beginning)


Extendible Hashing

Disadvantages of static Hash methods with strongly growing volume of data Primary area must be largely dimensioned from the beginning

( bad initial allocation) If the capacity of the primary area is exceeded, the overflow

chains grow fast Run time behavior degrades Reorganization requires to unload the entire volume of data and

to load it again interruption of the operation (often not possible, e.g., with 24x7 operation)


Extendible Hashing

Therefore we need a Hash method that Permits dynamic growing and shrinking of the Hash area Guarantees constant run time behavior independently of the size

of data Requires not more than 2 page accesses for finding a record Avoids overflow mechanisms and total reorganization Guarantees a high allocation of the memory independently of the

growth of the key set


Extendible Hashing

Must avoid overflow buckets

Would like stability are ready to pay for it, i.e., constantly 2 accesses

Available (known to us) techniques Balancing the B-Trees (constant path length) Addressing techniques via coding of the key from digital trees

Extendible Hashing uses these techniques in order to guarantee a stable access with exactly 2 I/O operations.


Extendible Hashing

Hash function transforms keys into binary strings (coding)

Only the first n bits are used if necessary (addressing like in the digital tree)

Additional indirection over container board Having few keys, few bits are sufficient With many keys additional bits are used

Containers are if necessary added or removed (balancing)

Container board is “doubled” if necessary memory space costs, but not high intensive computations


Example: Extendible Hashing

Insertion sequence: 11, 32, 8, 25, 21, 15, 2, 18, 13, 20, 4, 27

11 001011 2 00001032 100000 18 0100108 001000 13 00110125 011001 20 01010021 010101 4 00010015 001111 27 011011


Extendible Hashing, b=2

Initial situation Container board contains only a

reference To an empty container

Insert11 00101132 100000works without problems



Next key8 001000

Doesn’t fit anymore

Thus, doubling of the capacity through duplication of the container board (still no extra containers!)



Blue numbers: implicit through addresses of the container board

Now: next key 8 001000

Fits through partition of the boards



Next key25 011001

Doesn’t fit in the first board, no other address available (for partition of the container) container board has to be doubled



Again: through doubling of the container board, no extra container is generated

Next key (still)25 011001



Additional container



Next key21 010101

No problems



Next key15 001111

Easy doubling of the container board



Next key15 001111

Still not possible Doubling again



Next key15 001111

Now selectivity is sufficient big container doubling



Next key (straight-forward)2 00001018 01001013 00110120 0101004 00010027 011011



Finish


Extendible Hashing

Within the key the prefix doesn’t need to be used always, one can also use the postfix

Within keys which are not uniformly distributed, an internal hash function can be used to produce the bit string to utilize in extendible hashing


Summary, extendible Hashing

Key fragment with n bits direct hashing (container board)

Container having a bucket factor b>1 (typically b>20)

Search Look up the container address in the container board Search in the container (e.g., binary search)



Insert Look up the container address in the container board Search in the container If found good, no further actions If not found

• If there is a free slot in the container insert• If no free slot is there

- Double the container board until the key fragment is selective enough to establish more containers (note: sometimes the container board doesn’t need to be doubled)

- Add new containers and if needed, redistribute keys from the old container among the new containers



Delete Look up the container address in the container board Search in the container If found delete If container is empty delete the container, set pointer in the

container board to the neighbor container


Extendible Hashing

In principle very similar to direct hashing using the first bits of the key (h(k) = k / 2x)

BUT: Within direct hashing the doubling of the table if an overflow occurs is much more expensive. For extendible hashing, each pointer should only be set to two successive addresses, for direct hashing each address should be split.


Example

Extendible hashing Direct Hashing

(There is no container board in direct hashing, but we added it here for the sake of understanding)


Analysis, extendible Hashing

Search has a constant cost, two I/O operations

Delete is combined if needed with the deletion of the container, but still constant cost

For insert “usually” max. 5 operations (search, write to the container, if needed write to other containers, write to the container board)

BUT IN ADDITION: If needed reorganization of the container board (duplicate all pointers)


Analysis, extendible Hashing

Doubling of the container board occurs mainly in the main memory low cost in comparison to I/O operations

A very successful and widely used method


Excursus: Pseudo-random numbers

A topic which is well related to hashing

Why “pseudo”-random numbers Computer is a “good” computational menial Algorithms are always executed reliably in a similar way Consequence: generating random numbers is not a strength of

computers!

Applications Games Simulation Generating keys for cryptography

But specially also numerical solutions of problems


Example of an application

Computation of Pi

Surface of the unit circle (Pi)

Compute the surface offourth of the circle (Pi/4)numerically and thenmultiply by 4 Pi

1

1

1


Compute Pi

Counting:36 x 36 =1296 smallboxes

Or roll the dice!

11

65

64

63

62

61

56

55

54

53

52

51

46

45

44

43

42

41

36

35

34

33

32

31

26

25

24

23

22

21

16

15

14

13

12

66

1

66

66655555

544444433

33332222

2211111

6

1

543

216543216

54321654

321654321

65432

6


Compute Pi

Particularly for computations of four-dimensional cases (e.g., physic systems with many degrees of freedom, computation of physic simulations, crash tests, …) it isn’t possible to go through all possible parameters systematically

The utilization of (good) multi-dimensional random numbers can lead to better results while using less values


Pseudo-random numbers

For this type of applications, pseudo-random numbers are even better than “real” random numbers

How works a normal pseudo-random generator? Needs an initialization z0

A random function computes starting from the last random number the next one:zn = Z(zn-1)

Requirements are also like those of hash-/collision resolution functions: Uniform distribution of the random numbers All random numbers (from a specific interval) should eventually

appear once in the sequence


Example: Mid-square-generator

Was implemented e.g., in Apple II

zn = middle_digits(zn-12)

Example: z0 = 42

42 x 42 = 1764; 76 x 76 = 5776 etc.

Sequence: 42 – 76 – 77 – 92 – 46 – 11 – 12 – 14 – 19 – 36 – 29 – 84 – 5 – 2 – 0 – 0 – 0 - …

Many sequences either ends with “0” or are repeated continuously (24 – 57 – 24 – 57 - …)

Very bad generator


Linear congruence-generator

Better: linear congruence- generator

Appears to be familiar to us

zn = (zn-1 * a + b) mod m

Example:zn = (zn-1 * 21 + 17) mod 40

… generates an optimal sequence …1 - 38 - 15 - 12 - 29 - 26 - 3 - 0 - 17 - 14 - 31 - 28 - 5 - 2 - 19 - 16 - 33 - 30 - 7 - 4 - 21 - 18 - 35 - 32 - 9 - 6 - 23 - 20 - 37 - 34 - 11 - 8 - 25 - 22 - 39 - 36 - 13 - 10 - 27 - 24 - 1



zn = (zn-1 * a + b) mod m

Parameter a, b, m determine the quality

Like in Hashing: it is reasonably easy to define the minimal requirements for a good quality e.g., a, m coprime

But: uniform distribution for multi-dimensions is hard

Example: 2, 7, 4, 9, 6, 1, 8, 3, 0, 5, …

One-dimension: uniformly distributed

Two-dimensions: (2, 7) (4, 9) (6, 1) (8, 3), (0, 5) located in two “lines” – not uniformly distributed



Separate research area in computer science and mathematics which is focused on finding good pseudo-random generators

For numerical applications pseudo-random numbers are often better than real random numbers

For cryptography this doesn’t apply anymore – there are plug-in cards which generate real random numbers because of quantum physics …


Thoughts: Hash / Random

Often, the computer produces apparently chaos

The computer can not do this really: if you look deeply it is always another way of ordering

“Chaotic” arrangement of data in hash tables and pseudo-random generators are good examples for this

Documents

© Neeraj Suri EU-NSF ICT March 2006 Dependable Embedded Systems & SW Group Introduction to Computer Science 2 Hash