19
Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Embed Size (px)

Citation preview

Page 1: Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill

Hashing

A quick lookup strategy

Page 2: Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill

What is a hash?

• Hashing is another name for key transformation

• The original key is usually a character string or other sparse key

• The result is usually a dense integer key

Page 3: Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill

Components

• Every hash scheme has three pieces– A hash function– A collection of buckets– Collision scheme

• The hash function transforms the key into an integer 0 – N

• The buckets are numbered 0 – N and hold the resulting data

Page 4: Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill

Basics

• The idea: – Have N containers for the data– Have a function that maps the original

key to a number in the range 0 – (N-1)– One access retrieves the data

regardless of size– When it works it is the fastest way to

lookup a value

Page 5: Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill

Hash function

• Convert a sparse key into a dense key in the form of an integer

• The input is the real key which is a sparse key– Sparse keys use very few of very

many possible key values

• Typical key is a character string while the result needs an integer

Page 6: Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill

Typical hash function

• Input is a character string• Output is an integer in range 0 - N• Sum the ordinal value of each character

of the string• Divide the result by N and take the

remainder– This guarantees the right range

• Number theory tells us to make N prime

Page 7: Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill

Example values with N = 256

• “Abcdef” returns 53• “Hi there” returns 233• “ABCDEF” returns 149• “FEDCBA” returns 149• “A character string” returns 197 • “A big character string” returns 23• “Zoology” returns 243• See the pattern yet?

Page 8: Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill

Hash functions

• Computing the function is easy• Two problems

– One result produced by two different keys – called a collision

– The order of the original key is scrambled in the result

• Very good for an equality test and very bad for a range test

Page 9: Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill

Is it that simple?• Not usually• The previous scheme maps all

words with exactly the same characters to the same integer value

• Many variation– Do a different computation on even

and odd characters so a re-arrangment will produce different values

Page 10: Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill

Collisions

• Two keys giving the same integer– Assume each bucket may only hold

one record

• Prevent hashing from the optimal search technique it could be

• Almost every hashing scheme needs a collision strategy

• There have been many variations

Page 11: Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill

Collision Strategies

• Linear probing– Add one to index until right one is

found

• Quadratic probing– Square the index when the collision

occurs

• Rehash– Secondary hash function

• Chaining– Link duplicates in an overflow area

Page 12: Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill

Collision Strategies

0

1

5

6

7

4

2

3

6 m

6 y

2 e

2 t

1 d

5 c

0

5

4

3

2

1

6

7

1 d

2 e

5 c

6 m

2 t

6 y

Linear Quadratic Chaining

0

1

2

3

4

5 5 c

6

7

2 e

6 m1 d

2 t

6 y

Page 13: Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill

Free Space

• As the free space decreases the collisions increase

• The best plan is to have about twice as many buckets as will be needed

Page 14: Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill

Effect of load using linear probing

Load factor Number of probes

10% 1.06

25% 1.17

50% 1.5

75% 2.5

90% 5.5

95% 10.5

Page 15: Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill

Dynamics

• A hash table is much harder to both grow and shrink

• Increasing or decreasing the number of buckets requires:– Allocate a new table (larger or smaller)– Rehashing every item in old into the new

• Deleting a record (without resizing) is also a problem because of collisions

Page 16: Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill

Deletion of a key

• The problem is collisions• If the key has no collisions just

mark the bucket as empty• If it has collisions

– All of them need to be rehashed– Finding them depends on the collision

strategy

Page 17: Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill

Two more terms

• A perfect hash is one that has no collisions– Only occurs in situations with two

conditions:– Keys are fixed and known in advance– The hash function is tailored to these

keys• A minimal hash has no empty

buckets

Page 18: Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill

Data Skew• A good hash function spreads the

keys uniformly among the integer range

• How well does this work when the data is not uniformly distributed?

• For example consider – Names - There are many more Smiths

than Garnjobsts– Numbers – there are many more 101

courses than most other numbers • Can the hash function still give a

good distribution?

Page 19: Copyright © 2003-2005Curt Hill Hashing A quick lookup strategy

Copyright © 2003-2005Curt Hill

Summary

• Hashing works best when it works well– Stable index– Good distribution from the hash

function

• It is much harder to make dynamic than trees