Hash Tables. If keys are not numbers What we will do in situations where the key is no longer an index that can be used directly as in array indexing

Hash Tables

If keys are not numbers

What we will do in situations where the key is no longer an index that can be used directly as in array indexing.

e.g., the keys are names.

• We can compute the index based on the key.that is to set up a one-to-one correspondence between the

keys (by which we wish to retrieve information) and indices (that we can use to access an array).

Keys are two letter names

• Now lets consider the case of a table containing information for students using their last names as keys.

• To make the example simple, we will pretend that the students are enrolled here all have unique last names consisting of exactly two letters.

• That is, the names go from AA to ZZ inclusive.Then Mr. AA should have an index of 0 in the table.Mr. ZZ should have an index of 262 – 1 (=675) in the

table.• There can be a maximum of 262 (=676) students with

two letter names.

What is our index function?

if A=0 and Z=25:i= 261*(last_name[0]) + 260*( last_name[1]) or,i= 261*(last_name[1]) + 260*( last_name[0])

if A=65 and Z=90: in C++, chars are just 8 bit integers, ASCII values.

i= 261*(last_name[0]-‘A’) + 260*( last_name[1]-‘A’)

Keys are real names

• Of course the example above was not particularly realistic, since the keys that we transformed into indices were supposed to represent peoples names, and restricting names to two letters is not reasonable.

• We will reconsider this example in the context of restricting names to be precisely ten characters long. For now we will ignore the possibility that names might be shorter.

• We will develop a new index function for this situation and will arrive at a new problem related to the size of the resulting table.

• It will be far too large to be accommodated.

• Despite its enormous size, it will contain only a small fraction of actual data.

• As a result we describe the table as sparse. It consists largely of empty space.

index function for real names: i=260*last_name[0] + 261*last_name[1]

+262*last_name[2] + ... +269*last_name[9]

• But consider the number of possible keys! 2610 or about 1X1014,

billions of times greater than the population of the AUC!!

• If we set up a table using this index function, the actual number of students in it would be very small by comparison to the number of possible keys. As a result the table is said to be a sparse table, and consists mostly of wasted space.

sparse table • Sparse table is a table indexed by a very large set,

but with relatively few positions actually occupied.

The point? • situations arise where the number of

possible keys far outweighs the number of actual keys.

• our tables must be able to accommodate all possible keys in order for our index function to work.

• In such a situation it is not practical to use an index function which produces a unique index for every possible key.

Conclusion:• An ordinary table is not an appropriate

data structure for this situation since the table would be too large to be practical.

• For a smaller problem, it would be very wasteful because of its sparsity.

• Therefore an ordinary table is not a good solution for sorting real names.

How do we deal with such situations?Review:• Why do we want an index function in the first place?• Why not store student’ s data in a list?• Then we have to search through the list to locate the

data.• problem: such searching is O(n2) in worst case, and O(lg

n) in the best case. (depending on the algorithm.)• An indexed table allows us to retrieve a students record

in O(1) time.Summary:

– want a fully indexed table for efficiency– can’t afford an ordinary table because of

wasted space.

How do we deal with such situations?

We wanted to be able to achieve the benefits of O(1) access to entries in a table without having a full or complete table.

The solution to our problem is to use a hash table.

hash table• In a hash table, we use a hash function (index

function) where the index is restricted to fall within a limited range much smaller than the possible number of keys, and

• This “limited range” will simply be the size of the hash table.

• In our case it will be the expected total number of students (roughly 3000 not 2610 or, 1x 1014).

Hash Tables

• It uses an index function that does not produce unique indices.

• In reality all hash function fails to produce unique indices, this is not something that we do intentionally, but rather is an unfortunate side effect of the property of a hash table.

Implementing Hash Table• We begin with a hash function that takes a key and maps it

to some index in the array. This function will generally map several different keys to the same index. If the desired record is in the location given by the index, then our problem is solved;

• otherwise we must use some method to resolve the collision that may have occurred between two records wanting to go to the same location. There are thus two questions we must answer to use hashing:– First, we must find good hash functions.– Second, we must determine how to resolve collisions.

• Before approaching these questions, let us pause to outline informally the steps needed to implement hashing.

Algorithm Outlines for Hash Tableinitialization

• First, an array must be declared that will hold the hash table. Next, all locations in the array must be initialized to show that they are empty.

insertion

• To insert a record into the hash table, the hash function of its key is first calculated. If the corresponding location is empty, then the record can be inserted, else if the keys are equal, then insertion of the new record would not be allowed, and in the remaining case (a record with a different key is in the location), it becomes necessary to resolve the collision.

retrieval

• To retrieve the record with a given key is entirely similar. First, the hash function for the key is computed. If the desired record is in the corresponding location, then the retrieval has succeeded; otherwise, while the location is nonempty and not all locations have been examined, follow the same steps used for collision resolution. If an empty position is found, or all locations have been considered, then no record with the given key is in the table, and the search is unsuccessful.

Hash Function• The index function used in a hash table is known as a “hash function.”

• This is where the name “hash table” originates from.

• The hash function effectively hashes, or “chops” the key producing from it a value that is no longer recognisable from the original key.

• The most important distinction between an ordinary index function and a hash function is that unlike a regular index function, a hash function does not yield a one-to-one correspondence between indices and keys.

• Rather, the number of indices produced by a hash function is much smaller than the possible number of keys.

• In other words, there is an n-to-m relationship between indices and keys where n<<m.

• We will now discuss a number of strategies used by hash functions to achieve this.

Choosing a Hash Function

The two principal criteria in selecting a hash function are as follows:

• A hash function should be easy and quick to compute.

• A hash function should achieve an even distribution of the keys that actually occur across the range of indices.


• The usual way to make a hash function is to take the key, chop it up, mix the pieces together in various ways, and thereby obtain an index that will be uniformly distributed over the range of indices.

• Note that there is nothing random about a hash function. If the function is evaluated more than once on the same key, then it must give the same result every time, so the key can be retrieved without fail.


• Truncation: Sometimes we ignore part of the key, and use the remaining part as the index.

• Folding: We may partition the key into several parts and combine the parts in a convenient way.

• Modular arithmetic: We may convert the key to an integer, divide by the size of the index range, and take the remainder as the result.

A better spread of keys is often obtained by taking the size of the table (the index range) to be a prime number.

truncation:

key index1734 3431952 5225 25972 72

• this method might be used if we expect only 100 (00 ~ 99) actual keys from all the possible keys

• truncation from left - could have been right ... or even middle

• advantages: fast• disadvantages: often fails to distribute the keys evenly in the table

.

folding:

divide the key into a number of parts and combine with some mathematical function +, *, etc.

example: 254-1072 phone numbertake all the digits of the exchange code 254, 949, 945,

759, 253, etc.and add to the remaining part of the number:

# 254-1072 254+1072=1326 # 945-1425 945+1425=2370

• disadvantage: a bit slower than truncation• advantage: better spread of keys than truncation.

modular arithmetic:

use the “mod” function• divide key by some integer, keep the remainder and

discard the result.ex: 2736 % 300 = 362525 % 300 = 125

• advantage: good spread of keys – an index produced will fall into a desired range since:

0 x % 300 299, or more generally, 0 x % N (N-1)

desired range is N• disadvantage: division (needed for mod function) can

be a costly operation.

Hash Tables

consider 14 words :

zany zest zing zoomzeal zeta zion zuluzebu zeus zone zero zinc zonk

Hash Tables

hash function:h = [(c1-‘a’) * (c0-‘a’) %7 + (c1 –‘a’) % 14 + (c2 –‘a’) % 11] % 14

ex: ‘zany’h = [(‘n’-‘a’) * (‘y’-‘a’) % 7 + (‘n’ –‘a’) % 14 + (‘a’ – ‘a’) % 11] % 14 = [ 13 * 24%7 + 13%14 + 0%11 ] % 14 = [ 13 * 3 + 13 + 0 ] % 14 = [ 39 + 13 + 0 ] % 14 = 52 % 14 = 10

Hash Tablesh(“zany ”) = 10h( “zeal” ) = 4h( “zebu” ) = 11h( “zero” ) = 7h( “zest” ) = 0h( “zeta” ) = 9h( “zeus” ) = 6h( “zinc” ) = 5h( “zing” ) = 1h( “zion” ) = 8h( “zone” ) = 12 h( “zonk” ) = 13h( “zoom” ) = 3h( “zulu” ) = 2

minimal perfect hash

The hash function that we used is a special instance of a hash function known as a “minimal perfect hash.”

It is minimal in the sense that the indices it produced are just exactly enough to index the number of actual keys.

It is perfect in the sense that every key is transformed to a unique index value by the hash function. In general, we will not be so fortunate to be able to do this.

Collisions and Resolution:

In our last example our hash table represented an unrealistically fortunate solution.

We achieved O(1) retrieval for all keys as a result of our hash function being a “perfect minimal” hash.

In general, fate will not be quite so co-operative, and when we do find such a hash function, it will most likely become “less than perfect” as soon as we add a new key in our table.

We will now reveal the deep dark truth about the “fourteen four-letter Z-words” that were introduced earlier and the problem that this represents for us.

Hash Tables

Now, for the truth:- there are 15 4-letter z-words!!- the missing word is “zine”- our hash function is no good ... only produces

indices from 0 to 13 ... need 0 to 14 now!- A perfect minimal has may well exist- lets pretend that the best we can do is:

• h(“zine”)=h(“zero”), and one index is left unused.

collision

When two or more keys collide on the same index, we call this condition a “collision”.

• When a hash function maps two records to the same location we say a “collision” has occurred.

• Obviously two records can’t occupy the same place in the table!

This is a serious problem, and we will have to develop a strategy to deal with it!

Collision resolution

• When a collision occurs, the process of dealing with the collision is known as collision resolution.

• Collision resolution:- involves moving the colliding record to another

location– how? by re-computing the index.

- there are many strategies available for re-computing the index

- each strategy involves “probing” for an empty spot in the table

Hash Tables

Hash Tables

Hash Tables

Documents

Hash Tables. If keys are not numbers What we will do in situations where the key is no longer an index that can be used directly as in array indexing