7
1 15-121 Introduction to Data Structures July 26, 2010 Ananda Gunawardena Hashing II Lecture Outline Concept of Hashing Open Addressing Linear Probing Quadratic Probing Double Probing Rehashing The Perfect hash Java Hash utilities Examples Hashing is a simple idea Hashing is Good Constant time performance O(1) in insert and find So why not always use hashing? Space requirements Cost of calculating hash keys No order – max, min is hard to calculate Finding Hash Functions A Bad hash function A Good hash function Java HashCode Each object in Java has a hash code and equals method

Lecture Outline Hashing IIab/15-121N10/annotatedNotes/lecture16-1030.pdf · Lecture Outline Concept of Hashing Open Addressing Linear Probing Quadratic Probing Double Probing Rehashing

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture Outline Hashing IIab/15-121N10/annotatedNotes/lecture16-1030.pdf · Lecture Outline Concept of Hashing Open Addressing Linear Probing Quadratic Probing Double Probing Rehashing

1

15-121Introduction to Data Structures

July 26, 2010

Ananda Gunawardena

Hashing IILecture Outline

� Concept of Hashing

� Open Addressing

�Linear Probing

�Quadratic Probing

�Double Probing

� Rehashing

� The Perfect hash

� Java Hash utilities

� Examples

Hashing is a simple idea Hashing is Good

� Constant time performance O(1) in insert and find

� So why not always use hashing?

�Space requirements

�Cost of calculating hash keys

�No order – max, min is hard to calculate

Finding Hash Functions

� A Bad hash function

� A Good hash function

Java HashCode

� Each object in Java has a hash code and equals method

Page 2: Lecture Outline Hashing IIab/15-121N10/annotatedNotes/lecture16-1030.pdf · Lecture Outline Concept of Hashing Open Addressing Linear Probing Quadratic Probing Double Probing Rehashing

2

Collisions

Collisions

� Collisions cannot be avoided

�Unless we find a “perfect” hash function (later)

� Resolve collisions

�Quickly and efficiently

Open Addressing Techniques for collision handling

� Open addressing was widely used in early days for collision resolutions

�Programming languages w/o dynamic memory allocation

� We will discuss two major open addressing techniques and their analysis

�Linear Probing

�Quadratic Probing

Open Addressing

Linear Probing

� The idea:�Table remains a simple array of size N

�On insert(x), compute f(x) mod N, if the cell is full, find another by sequentially searching for the next available slot•Go to f(x)+1, f(x)+2 etc..

�On find(x), compute f(x) mod N, if the cell doesn’t match, look elsewhere.

�Linear probing function can be given by • h(x, i) = (f(x) + i) mod N (i=1,2,….)

Exercise

� Insert 58, 99, 68, 79, 59, 19, 99 to a table of size n = 11

Page 3: Lecture Outline Hashing IIab/15-121N10/annotatedNotes/lecture16-1030.pdf · Lecture Outline Concept of Hashing Open Addressing Linear Probing Quadratic Probing Double Probing Rehashing

3

Clustering problem

Clustering Problem• Clustering is a significant problem in linear probing. Why?• Illustration of primary clustering in linear probing (b) versus no clustering

(a) and the less significant secondary clustering in quadratic probing (c). Long lines represent occupied cells, and the load factor is 0.7.

Data Structures & Problem Solving using JAVA/2E Mark Allen Weiss © 2002 Addison Wesley

Avoiding clusters

� Pick a good hash table size

� Pick a good function

� Rehash when load factor gets too high

Deleting Items

Deleting Items

� How about deleting items from Hash table?

�Item in a hash table connects to others in the table(eg: BST).

�Deleting items will affect finding the others

�“Lazy Delete” – Just mark the items as inactive rather than removing it.

Lazy Delete

� Naïve removal can leave gaps!Insert f

Remove eFind f

0 a

2 b

3 c

3 e

5 d

8 j

8 u

10 g

8 s

0 a

2 b

3 c

5 d

3 f

8 j

8 u

10 g

8 s

0 a

2 b

3 c

3 e

5 d

3 f

8 j

8 u

10 g

8 s

0 a

2 b

3 c

5 d

3 f

8 j

8 u

10 g

8 s

“3 f” means search key f and hash key 3

Page 4: Lecture Outline Hashing IIab/15-121N10/annotatedNotes/lecture16-1030.pdf · Lecture Outline Concept of Hashing Open Addressing Linear Probing Quadratic Probing Double Probing Rehashing

4

Lazy Delete

� Clever removal Insert f

Remove eFind f

0 a

2 b

3 c

3 e

5 d

8 j

8 u

10 g

8 s

0 a

2 b

3 c

gone

5 d

3 f

8 j

8 u

10 g

8 s

0 a

2 b

3 c

3 e

5 d

3 f

8 j

8 u

10 g

8 s

0 a

2 b

3 c

gone

5 d

3 f

8 j

8 u

10 g

8 s

“3 f” means search key f and hash key 3

Load Factor (open addressing)

� definition: The load factor λλλλ of a probing hash table is the fraction of the tablethat is full. The load factor ranges from 0 (empty) to 1 (completely full).

� It is better to keep the load factor under 0.7

� Double the table size and rehash if load factor gets high

� Other Facts�Cost of Hash function f(x) must be minimized�When collisions occur, linear probing can always find an empty cell• But clustering can be a problem

Rehashing

Rehashing

� rehash 58, 99, 68, 79, 59, 19, 99 to a table of size n = 23

Linear Probing analysis

� Linear Probing�Idea is to find the “next” open space when collisions occur

� Analysis �Assume Table size is large�Each probe is independent of previous probe

• λλλλ is the load factor

� Theorem: If independence of probes is assumed, the average number of cells examined in an insertion using linear probing is 1/(1 – λλλλ).

� Proof: probability of empty cell is 1- λλλλ

Hence expected number of probes to find an empty cell is 1/(1- λλλλ)

Quadratic Probing

Page 5: Lecture Outline Hashing IIab/15-121N10/annotatedNotes/lecture16-1030.pdf · Lecture Outline Concept of Hashing Open Addressing Linear Probing Quadratic Probing Double Probing Rehashing

5

Quadratic probing

� Another open addressing technique

� Resolve collisions by examining certain cells (1,4,9,…) away from the original probe point

� Collision policy:

�Define h0(k), h1(k), h2(k), h3(k), …

where hi(k) = (hash(k) + i2) mod N

� Caveat:

�May not find a vacant cell!

• Table must be less than half full (λ < ½)

�(Linear probing always finds a cell.)

Keeping the Cost Down

� Finding the squares may be expensive

� Rewrite the equation

�hi(k) = (hash(k) + i2) mod N

� Multiplying by 2 is easy!

Quadratic probing Analysis

� Table size must be prime

� Load factor must be less than ½

Theorem: If the table is less than half full, and table size is prime, then a collision can always be resolved.

Rehashing

� Scaling up�What makes λ grow too large?

• Too much data

• Too many removals

� Rehash!�Do when insert fails or load factor grows�Build a new table

• Scan existing table and do inserts into new table

� Double the table size. Pick a prime table size.�What is the complexity of primality test?�What is the cost of rehashing all elements to the new table?

Questions

� What is the complexity of primality test?

� What is the cost of rehashing all elements to the new table?

Perfect Hash

� Perfect distribution of keys

� Perfect hash function always exists

� But may be hard to calculate

Page 6: Lecture Outline Hashing IIab/15-121N10/annotatedNotes/lecture16-1030.pdf · Lecture Outline Concept of Hashing Open Addressing Linear Probing Quadratic Probing Double Probing Rehashing

6

Perfect Hash

� Perfect hashing guarantees that you get no collisions at all.

� Suppose there are M keys and a table of size N (M ≤ N)�There are NM functions exists

�How can we find a function that is 1-1?

� When can we find a perfect hash?�Suppose we know all keys in advance.

�Example: hashing keywords for compilers.

Looking for the Perfect hash Function

%7 %8 %9 %10 %11 %12

4371 3 3 6 1 4

1323 0 3 0 3 3

6173 6 5 8 3 2

4199 6 7 5 9 8

4344 4 0 6 4 10

1989 1 5 0 9 9

9676 5 7 4 6 7

Here is one way to find it.

Expensive!

Hash Tables vs BST

� BST’s�BST has search O(log n) in the best case

�Bad data sets can make BST’s perform poorly

� Hashing�Hashing cannot do order operations

�Hashing is good when ordering is not important

�Insert and Find are the major operations supported

Java Tools

Java Tools - HashMap Class

� HashMap

� This implements the Map interface

� HashMap permits null values and null keys.

� Constant time performance for get and put operations

� HashMap has two parameters that affect its performance: initial capacity and load factor

� Capacity – number of buckets in the Hash Table

� Load factor – How full the Hash Table is allowed to get before capacity is automatically increased using rehash

function.

More on Java HashMaps

� HashMapHashMapHashMapHashMap(int initialCapacity)� putputputput(Object key, Object value)� getgetgetget(Object key) – returns Object� keySetkeySetkeySetkeySet() Returns a setsetsetset view of the keys contained in this map.� valuesvaluesvaluesvalues() Returns a collectioncollectioncollectioncollection view of the values contained in this map.

Page 7: Lecture Outline Hashing IIab/15-121N10/annotatedNotes/lecture16-1030.pdf · Lecture Outline Concept of Hashing Open Addressing Linear Probing Quadratic Probing Double Probing Rehashing

7

Applications

Java Hash Map

Counting Word Frequencies with Hash Maps

Finding Anagrams

What can we place in a hashtable?

� Almost anything

Conclusion

� Hash tables are widely used data structure in computer science

� Strength of hashing is insert and find is O(1)

� Weaknesses of hashing is space requirements and extracting order is hard

� Using hashing is easy with Java (HashMap class)

� In other languages (eg: C) need to implement your own hash API