34
Appendix E-A Hashing Modified

Appendix E-A Hashing Modified. Chapter Scope Concept of hashing Hashing functions Collision handling – Open addressing – Buckets – Chaining Deletions

Embed Size (px)

Citation preview

Appendix E-A

Hashing

Modified

Java Software Structures, 4th Edition, Lewis/Chase

Chapter Scope• Concept of hashing• Hashing functions• Collision handling– Open addressing– Buckets– Chaining

• Deletions• Performance

15 - 2

Java Software Structures, 4th Edition, Lewis/Chase

What is hashing?

• Hashing is a scheme for storing and retrieving information by (key) value. Sometimes used to implement associative memory.

• A hash function is used to map a value to a location; The value (and associated info) may be stored at that location or at least accessed via that location.

• Very efficient for storing and retrieving• Used extensively in computing – software and

hardware.15 - 3

Java Software Structures, 4th Edition, Lewis/Chase

Collisions• Ideally, the value being mapped would be

stored at the mapped location in the location space, and this could be true for a perfect hashing function.

• However, in most situations, multiple values will/may map to the same location (collisions)

• So we have to have to have a strategy to handle collisions.

• There are several popular collision handling strategies.

15 - 4

Hash function• A hash function is a mapping from a value space to a

location space.• The value space is any domain of values. Strings, ints,

phone numbers, student IDs, …• The location space is normally a sequence of integers

from 0 to N-1, where N is the size of the location space. The location space resembles a 1-dimentional array (like computer memory).

15 - 5Value spaceLocation space

Java Software Structures, 4th Edition, Lewis/Chase

Characteristics of a good hash function• It should cover the entire location space• It should distribute the key values fairly evenly

into the location space• Generally, 2 values that are “close together” in

the value space should not be close together in the location space.

15 - 6

Aside: Cryptographic hashing

Java Software Structures, 4th Edition, Lewis/Chase

Division (remainder) hash function• Probably most commonly used method, either by

itself or combined with another method.• If the location space is of size N, divide the value

(somehow represented as an integer) by N and take the remainder as the result.

• Choosing N to be a prime number improves the likelihood of the mapping distributing the values fairly evenly.

15 - 7

Java Software Structures, 4th Edition, Lewis/Chase

Representing a value as an integer

• We know that all information stored in computer storage is a string of bits.

• Any string of bits can be interpreted as a binary integer.

• So how to we make that interpretation?• Modern languages try to prevent us from

changing out interpretation of a string of bits – strong typing.

15 - 8

Java Software Structures, 4th Edition, Lewis/Chase

Char to intJava does give us a loophole for character data

System.out.println( (int) "ABC".charAt(0));displays 65

The data type char is an “integral” type, and can be automatically converted to an int or long.

Int I = ‘B’; // sets I to 66;15 - 9

Java Software Structures, 4th Edition, Lewis/Chase

Also, can use bitwise and bit shift ops.• ~, &, |, ^, <<, >>, >>>

int i = 'B';System.out.println( i); //displays 66 System.out.println( (int) "ABC".charAt(0)); //displays 65System.out.println( 'A' & 'B' ); //displays 64 System.out.println( 'A' | 'B' ); //displays 67 System.out.println( 'A' ^ 'B' ); //displays 3System.out.println( ~'A' ); //displays -66 System.out.println( 'A' << 2 ); //displays 260System.out.println( 'A' >>> 1 ); //displays 32

15 - 10

Java Software Structures, 4th Edition, Lewis/Chase

Folding• Divide the value into parts and then combine

them.• Example:

value is 234-56-9876234965

+ 876---------------

2075 % N15 - 11

Java Software Structures, 4th Edition, Lewis/Chase

Other hash functions• Mid square -- Square the value, as a number, and

take a portion out of the middle of that product.• Extraction involves using only a part of an

element’s value or key to compute the location at which to store the element

• Length dependent – use a portion of the value, then combine with the length of the value.

15 - 12

Java Software Structures, 4th Edition, Lewis/Chase 15 - 13

Hashing Functions - Digit Analysis

• In the digit analysis method, the index is formed by extracting, and then manipulating specific digits from the key

• For example, if our key is 1234567, we might select the digits in positions 2 through 4 yielding 234

• The manipulation can then take many forms– Reversing the digits (432)– Performing a circular shift to the right (423)– Performing a circular shift to the left (342)– Swapping each pair of digits (324)

• Alternately, these manipulations could be done on the bits

Appendix E-B

Hashing – Open Addressing

Modified

Java Software Structures, 4th Edition, Lewis/Chase 15 - 15

Open AddressingA.K.A Closed Hashing

• All hashed entries, including collisions. are stored within the hash table (closed array)

• Colliding entries are stored at (open addresses)/locations within the table.

• When a collision occurs and the entry cannot be stored at its home address (to which it was originally hashed), the table is probed for an open position in the table where it can be stored.

• When the entry is looked for, this same probe sequence must be followed until it is found or determined that it is not in the table.

Java Software Structures, 4th Edition, Lewis/Chase

Three probing approaches

1. Linear probing2. Quadratic probing3. Double hashing

15 - 16

Open addressing using Linear Probing

In linear probing, if an entry hashes to position P and that position is occupied we simply probe for empty positions at (P + I) % TableSize

where I = 1,2,3,4 …or some other linear sequence

Issues with Linear probing• Linear Probing may lead to clustering; both good

and bad. Increases average number of probes, but gives good locality of reference (if interval is 1).

• Deletions are marked as deletions, not empty;they can be reused, but they do not mark the end of a probe sequence.

• Need table size to be a prime number to ensure all positions are in probe sequences.

15 - 18

Java Software Structures, 4th Edition, Lewis/Chase

Issues with Linear probing• Performance drops off as load factor nears 80%• Must expand table and rehash all entries

https://www.cs.usfca.edu/~galles/visualization/ClosedHash.html

15 - 19

Java Software Structures, 4th Edition, Lewis/Chase

Quadratic probing• In Quadratic probing, the probe interval is a

quadratic polynomial – I2 in the simplest case.• So, if an entry hashes to position P and that

position is occupied we simply probe for empty positions at

• (P + I2) % TableSize where I = 1,2,3,4 …

• Less primary clustering than with linear probing• https://www.cs.usfca.edu/~galles/visualization/

ClosedHash.html15 - 20

Java Software Structures, 4th Edition, Lewis/Chase

Double Hashing• The interval between probes is computed by

another hash function H2(x)• So, if an entry x hashes to position P and that

position is occupied we simply probe for empty positions at

• ( P + I * ( H2(x) ) % TableSize where I = 1,2,3,4 …

• Less primary clustering than with linear probing• https://www.cs.usfca.edu/~galles/visualization/Cl

osedHash.html15 - 21

Appendix E-C

Hashing – Buckets

Modified

Java Software Structures, 4th Edition, Lewis/Chase

Buckets• The locations in the hash table are referred to as

cells or as buckets.• A bucket can be big enough to hold several

entries (not just one).• So, entries are hashed to a bucket location, and

colliding entries can be stored in the same bucket until it becomes full.

• After it becomes full, the colliding elements can be stored in a common overflow area.

15 - 23

Java Software Structures, 4th Edition, Lewis/Chase

• What is this advantage of this approach over some of the other open addressing approaches?

• Locality of reference – the likelihood that when you are accessing a place in memory or on disk, that the next place you reference is “nearby”.

• This makes for better efficiency in virtual memory and in more efficient disk access.

• Eliminates primary clustering• https

://www.cs.usfca.edu/~galles/visualization/ClosedHash.html 15 - 24

Appendix E-D

Hashing – With chaining

Modified

Java Software Structures, 4th Edition, Lewis/Chase

Chaining • The chaining method simply treats the hash table

conceptually as an array of lists of individual elements

• Thus each hash value locates a list of all entries that hash to (collide at) that hash location.

• These lists are usually linked (chained) lists.

15 - 26

Java Software Structures, 4th Edition, Lewis/Chase 15 - 27

The chaining method of collision handling

Two variants:1. The table cells can contain

the data being stored, or2. The table cells can contain

only head pointers to the lists, with all data being stored in the list nodes.

Pros and cons of each variant?

0 12…

N-1

Java Software Structures, 4th Edition, Lewis/Chase

Basic operations

• Insert• Find• Delete

Lists can be ordered or not

15 - 28

Java Software Structures, 4th Edition, Lewis/Chase

Pros of chaining – compared to closed hashing

• Hash table does not ever have to be expanded. • Performance degrades more slowly as table fills

up.• Fewer (or no) empty table (data) spaces.• Insertion (at the head of list) is simple and takes

constant time.• Deletion does not require special treatment.• No clustering

15 - 29

Java Software Structures, 4th Edition, Lewis/Chase

Cons of chaining – compared to closed hashing

• Extra space used for pointers • Extra time required to allocate list nodes

dynamically!!!• Worse locality of reference. Significant if lists get

long.

Size (and number) of data records must be considered.

15 - 30

Java Software Structures, 4th Edition, Lewis/Chase 15 - 31

Chaining using an overflow area

Chaining (with simulated links) can be accomplished using an array based structure with an overflow area.

Pros and cons??

Java Software Structures, 4th Edition, Lewis/Chase

Coalesced hashing

15 - 33

Omit!

Java Software Structures, 4th Edition, Lewis/Chase

Incremental resizing of a hash table

15 - 34

Some hash table implementations, notably in real-time systems, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually:

• During the resize, allocate the new hash table, but keep the old table unchanged.

• In each lookup or delete operation, check both tables.• Perform insertion operations only in the new table.• At each insertion also move r elements from the old table to the

new table.• When all elements are removed from the old table, deallocate it.

To ensure that the old table is completely copied over before the new table itself needs to be enlarged, it is necessary to increase the size of the table by a factor of at least (r + 1)/r during resizing.