15-211 Fundamental Structures of Computer Science

15-211Fundamental Structuresof Computer Science

Jan. 30, 2003Ananda Guna

Dictionaries, Tries and Hash Maps

Based on lectures given by Peter Lee, Avrim Blum, Danny Sleator, William Scherlis, Ananda Guna & Klaus Sutner

Dictionaries

A data structure that supports lookups and updates

Computational complexity: The time and space needed to authenticate the

dictionary, i.e. creating and updating it. The time needed to perform an authenticated

membership query. The time needed to verify the answer to an

authenticated membership query. Implementing a Dictionary

– A sorted array?– A sorted linked list?– A BST?– Array of Linked Lists?– Trie?

What is a Trie?

Trie of size (h, b) is a tree of height h and branching factor at most b

All keys can be regarded as integers in range [0, bh]

Each key K can be represented as h-digit number in base b: K1K2K3…Kh

Keys are stored in the leaf level; path from the root resembles decomposition of the keys to digits

root

20 22 24 31 32 42 43

23

4

0 2 4 1 2 2 3

Motivation for tries:An Application of Trees

Sample problem: How to do instant messaging on a telephone keypad?

Typing “I love you”: 444 * 555 666 888 33 * 999 666 88 #

What is the worst-case number of keystrokes?

What might be the average number of keystrokes?

Better approach: 4* 5 6 8 3* 9 6 8#

Can be done by using tries.

Tries

A trie is a data structure that stores the information about the contents of each node in the path from the root to the node, rather than the node itself

A tree structure that encodes the possible sequences of symbols in a dictionary. Invented by Fredkin in 1960.

For example, suppose we have the alphabet {a,b}, and want to store the sequencesaaa, aab, baa, bbab

Tries

aaa, aab, baa, bbab

No information stored in nodes per se.

Shape of trie determines the items.

Tries

Nodes can be quite large: N pointers to children, where N is the size of the alphabet.

Operations are very fast:

search, insert, delete are all O(k) where k is the length of the sequence in question.

Keypad IM Trie

4 5 9

4 6 6

5 8 8

3 3

I

like love

you

Implementing a Trie

How do we implement a trie? How about a tree of arrays?

You can also use HashMaps to implement a trie (later)

A C D

A

Hashing

Why Hashing?

Suppose we need to find a better way to maintain a table that is easy to insert and search.

If we use a sorted list, you can do binary search and insertion in log2n time.

So is there an alternative way to handle operations such as insert, search, delete?

Yes, Hashing

Big Idea

Suppose we have M items we need to put into a table of size N.

Can we find a Map H such that H[ith item] [0..N-1]?

Assume that N = 5 and the values we need to insert are: cab, bea, bad etc.

Assume that we assign values to letters: a=0, b=1, c=2, etc

Big Idea Ctd..

Define H such that H[data] = ( characters) Mod N

H[cab] = (0+1+2) Mod 5 = 3 H[bea] = (1+4+0) Mod 5 = 0 H[bad] = (1+0+3) Mod 5 = 4

bea cab bad

0 1 2 3 4

Problems

CollisionsWhat if the values we need to insert are “abc”,

“cba”, “bca” etc…They all map to the same location based on

our map H (obviously H is not a good hash map)

This is called “Collision” One way to deal with collisions is “separate

chaining” The idea is to maintain an array of linked lists More on collisions later

Running Time

We need to make sure that H is easy to compute (constant time)

Lookups and deletes from the hash table depends on H

Assume M = theta(N) So what is a “bad” H? Suppose we hash strings by simply adding up

the letters and taking it mod the table size. Is that good or bad?

Homework: Think of hashing 1000 5-letter words into a table of size 1000 using the map H. What would be the key distribution like?

What is a good H?

If H behaves likes a random function, there are N-1 other keys with equal probability(1/M) that can collide with the given Key.

Therefore E(collision of a Key) = (N-1)/M If M = Theta(N) then this value is 1. This is great.

But life is not fair. So what is a good Hash function? Lets consider a hashing a set of strings Si. Say

each Si is of some length i. Consider H(Si) = ( Si[j].d

j ) Mod M, where d is some large number and M is the table size.

Is this function hard to calculate?

Collisions

Hash functions can be many-to-1They can map different search keys to

the same hash key.hash(à`) == 9 == hash(`w`)

Collisions


the same hash key.hash1(à`) == 9 == hash1(`w`)

Must compare the search key with the record found

Collisions


the same hash key.hash1(à`) == 9 == hash1(`w`)

Must compare the search key with the record found If the match fails, there is a collision

Collision strategies

Separate chaining Open addressing

LinearQuadraticDoubleEtc.

The perfect hash

Linear Probing

The idea:Table remains a simple array On insert, if the cell is full, find another

by sequentially searching for the next available slot

On find, if the cell doesn’t match, look elsewhere.

Eg: Consider H(key) = key Mod 6 (assume N=6) H(11)=5, H(10)=4, H(17)=5, H(16)=4,H(23)=5 Draw the Hash table

Linear Probe ctd..

How about deleting items?Item in a hash table connects to others

in the table(eg: BST). “Lazy Delete” – Just mark the items

active or delete rather than removing it.

More on Delete

Naïve removal can leave gaps! Insert f

Remove e Find f

0 a

2 b3 c3 e5 d

8 j8 u

10 g8 s

0 a

2 b3 c

5 d3 f

8 j8 u

10 g8 s

0 a

2 b3 c3 e5 d3 f

8 j8 u

10 g8 s

0 a

2 b3 c

5 d3 f

8 j8 u

10 g8 s

“3 f” means search key f and hash key 3

More about delete ctd..

Clever removal shrinks the table Insert f

Remove e Find f

0 a

2 b3 c3 e5 d

8 j8 u

10 g8 s

0 a

2 b3 c

gone5 d3 f

8 j8 u

10 g8 s

0 a

2 b3 c3 e5 d3 f

8 j8 u

10 g8 s

0 a

2 b3 c

gone5 d3 f

8 j8 u

10 g8 s

“3 f” means search key f and hash key 3

Performance of linear probing

Average numbers of probesUnsuccessful search and insert

½ (1 + 1/(1-)2)

Succesful search ½ (1 + 1/(1-))

When is low:Probe counts are close to 1

When is high: E.g., when = 0.75, unsuccessful probes:

½ (1 + 1/(1-)2) = ½ (1 + 16) = 8.5 probes E.g., when = 0.5, unsuccessful probes:

½ (1 + 1/(1-)2) = ½ (1 + 4) = 2.5 probes

Quadratic probing

Resolve collisions by examining certain cells away from the original probe point

Collision policy:Define h0(k), h1(k), h2(k), h3(k), …

where hi(k) = (hash(k) + i2) mod size

Caveat: May not find a vacant cell!

• Table must be less than half full ( < ½)(Linear probing always finds a cell.)

Quadratic probing

Another issueSuppose the table size is 16.Probe offsets that will be tried:

1 4 9162536496481

Quadratic probing


1 mod 16 = 4 mod 16 = 9 mod 16 = 16 mod 16 = 25 mod 16 = 36 mod 16 = 49 mod 16 = 64 mod 16 = 81 mod 16 =

Quadratic probing


1 mod 16 = 1 4 mod 16 = 4 9 mod 16 = 916 mod 16 = 025 mod 16 = 936 mod 16 = 4 49 mod 16 = 164 mod 16 = 081 mod 16 = 1

Quadratic probing


1 mod 16 = 1 4 mod 16 = 4 9 mod 16 = 916 mod 16 = 025 mod 16 = 9 only four different values!

36 mod 16 = 4 49 mod 16 = 164 mod 16 = 081 mod 16 = 1

Quadratic probing

Table size must be prime Load factor must be less than ½

Rehash

Scaling upWhat makes grow too large?

• Too much data• Too many removals

Rehash!Do when insert fails or load factor growsBuild a new table

• Scan existing table and do inserts into new table

Rehash

Scaling upWhat makes grow too large?

• Too much data• Too many removals

Rehash!Do when insert fails or load factor growsBuild a new table

• Scan existing table and do inserts into new table

Twice the size or moreAdds only constant average cost

Double Hashing

Collision policyDefine h0(k), h1(k), h2(k), h3(k), …

where hi(k) = (hash(k) + i*hash2(k)) mod size

Caveatshash2(k) must never be zeroTable size must be prime

• If multiples of hash2 results are equal to table size, fewer alternative cells will be tried.

Quadratic probing may be faster/easier in practice.

About HashMap Class

This implements the Map interface HashMap permits null values and null keys. Constant time performance for get and put

operations HashMap has two parameters that affect its

performance: initial capacity and load factor Capacity – number of buckets in the Hash Table Load factor – How full the Hash Table is allowed to get

before capacity is automatically increased using rehash function.

More on Java HashMaps

HashMap(int initialCapacity) put(Object key, Object value) get(Object key) – returns Object keySet() Returns a set view of the

keys contained in this map. values() Returns a collection view

of the values contained in this map.

http://java.sun.com/j2se/1.4/docs/api/java/util/HashMap.html#HashMap(int)

http://java.sun.com/j2se/1.4/docs/api/java/util/HashMap.html#put(java.lang.Object,%20java.lang.Object)

http://java.sun.com/j2se/1.4/docs/api/java/lang/Object.html

http://java.sun.com/j2se/1.4/docs/api/java/util/HashMap.html#get(java.lang.Object)

http://java.sun.com/j2se/1.4/docs/api/java/lang/Object.html

http://java.sun.com/j2se/1.4/docs/api/java/util/HashMap.html#keySet()

http://java.sun.com/j2se/1.4/docs/api/java/util/HashMap.html#values()

Next Week

HW2 is due Monday – you need to start early

We will discuss Priority Queues and recap of what we have done so far

Read More about Hash Tables in Chapter 20

See you Tuesday

Documents

15-211 Fundamental Structures of Computer Science