Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill

Hashing

A quick lookup strategy


What is a hash?

• Hashing is another name for key transformation

• The original key is usually a character string or other sparse key

• The result is usually a dense integer key


Components

• Every hash scheme has three pieces– A hash function– A collection of buckets– Collision scheme

• The hash function transforms the key into an integer 0 – N

• The buckets are numbered 0 – N and hold the resulting data


Basics

• The idea: – Have N containers for the data– Have a function that maps the original

key to a number in the range 0 – (N-1)– One access retrieves the data

regardless of size– When it works it is the fastest way to

lookup a value


Hash function

• Convert a sparse key into a dense key in the form of an integer

• The input is the real key which is a sparse key– Sparse keys use very few of very

many possible key values

• Typical key is a character string while the result needs an integer


Typical hash function

• Input is a character string• Output is an integer in range 0 - N• Sum the ordinal value of each character

of the string• Divide the result by N and take the

remainder– This guarantees the right range

• Number theory tells us to make N prime


Example values with N = 256

• “Abcdef” returns 53• “Hi there” returns 233• “ABCDEF” returns 149• “FEDCBA” returns 149• “A character string” returns 197 • “A big character string” returns 23• “Zoology” returns 243• See the pattern yet?


Hash functions

• Computing the function is easy• Two problems

– One result produced by two different keys – called a collision

– The order of the original key is scrambled in the result

• Very good for an equality test and very bad for a range test


Is it that simple?• Not usually• The previous scheme maps all

words with exactly the same characters to the same integer value

• Many variation– Do a different computation on even

and odd characters so a re-arrangment will produce different values


Collisions

• Two keys giving the same integer– Assume each bucket may only hold

one record

• Prevent hashing from the optimal search technique it could be

• Almost every hashing scheme needs a collision strategy

• There have been many variations


Collision Strategies

• Linear probing– Add one to index until right one is

found

• Quadratic probing– Square the index when the collision

occurs

• Rehash– Secondary hash function

• Chaining– Link duplicates in an overflow area


Collision Strategies

0

1

5

6

7

4

2

3

6 m

6 y

2 e

2 t

1 d

5 c

0

5

4

3

2

1

6

7

1 d

2 e

5 c

6 m

2 t

6 y

Linear Quadratic Chaining

0

1

2

3

4

5 5 c

6

7

2 e

6 m1 d

2 t

6 y


Free Space

• As the free space decreases the collisions increase

• The best plan is to have about twice as many buckets as will be needed


Effect of load using linear probing

Load factor Number of probes

10% 1.06

25% 1.17

50% 1.5

75% 2.5

90% 5.5

95% 10.5


Dynamics

• A hash table is much harder to both grow and shrink

• Increasing or decreasing the number of buckets requires:– Allocate a new table (larger or smaller)– Rehashing every item in old into the new

• Deleting a record (without resizing) is also a problem because of collisions


Deletion of a key

• The problem is collisions• If the key has no collisions just

mark the bucket as empty• If it has collisions

– All of them need to be rehashed– Finding them depends on the collision

strategy


Two more terms

• A perfect hash is one that has no collisions– Only occurs in situations with two

conditions:– Keys are fixed and known in advance– The hash function is tailored to these

keys• A minimal hash has no empty

buckets


Data Skew• A good hash function spreads the

keys uniformly among the integer range

• How well does this work when the data is not uniformly distributed?

• For example consider – Names - There are many more Smiths

than Garnjobsts– Numbers – there are many more 101

courses than most other numbers • Can the hash function still give a

good distribution?


Summary

• Hashing works best when it works well– Stable index– Good distribution from the hash

function

• It is much harder to make dynamic than trees

Documents

Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy