40
15-211 Fundamental Structures of Computer Science Jan. 30, 2003 Ananda Guna Dictionaries, Tries and Hash Maps Based on lectures given by Peter Lee, Avrim Blum, Danny Sleator, William Scherlis, Ananda Guna & Klaus Sutner

15-211 Fundamental Structures of Computer Science

  • Upload
    freya

  • View
    50

  • Download
    0

Embed Size (px)

DESCRIPTION

Dictionaries, Tries and Hash Maps. 15-211 Fundamental Structures of Computer Science. Jan. 30, 2003. Ananda Guna. Based on lectures given by Peter Lee, Avrim Blum, Danny Sleator, William Scherlis, Ananda Guna & Klaus Sutner. Dictionaries. A data structure that supports lookups and updates - PowerPoint PPT Presentation

Citation preview

Page 1: 15-211 Fundamental Structures of Computer Science

15-211Fundamental Structuresof Computer Science

Jan. 30, 2003Ananda Guna

Dictionaries, Tries and Hash Maps

Based on lectures given by Peter Lee, Avrim Blum, Danny Sleator, William Scherlis, Ananda Guna & Klaus Sutner

Page 2: 15-211 Fundamental Structures of Computer Science

Dictionaries

A data structure that supports lookups and updates

Computational complexity: The time and space needed to authenticate the

dictionary, i.e. creating and updating it. The time needed to perform an authenticated

membership query. The time needed to verify the answer to an

authenticated membership query. Implementing a Dictionary

– A sorted array?– A sorted linked list?– A BST?– Array of Linked Lists?– Trie?

Page 3: 15-211 Fundamental Structures of Computer Science

What is a Trie?

Trie of size (h, b) is a tree of height h and branching factor at most b

All keys can be regarded as integers in range [0, bh]

Each key K can be represented as h-digit number in base b: K1K2K3…Kh

Keys are stored in the leaf level; path from the root resembles decomposition of the keys to digits

root

20 22 24 31 32 42 43

23

4

0 2 4 1 2 2 3

Page 4: 15-211 Fundamental Structures of Computer Science

Motivation for tries:An Application of Trees

Page 5: 15-211 Fundamental Structures of Computer Science

Sample problem: How to do instant messaging on a telephone keypad?

Page 6: 15-211 Fundamental Structures of Computer Science

Typing “I love you”: 444 * 555 666 888 33 * 999 666 88 #

What is the worst-case number of keystrokes?

What might be the average number of keystrokes?

Page 7: 15-211 Fundamental Structures of Computer Science

Better approach: 4* 5 6 8 3* 9 6 8#

Can be done by using tries.

Page 8: 15-211 Fundamental Structures of Computer Science

Tries

A trie is a data structure that stores the information about the contents of each node in the path from the root to the node, rather than the node itself

A tree structure that encodes the possible sequences of symbols in a dictionary. Invented by Fredkin in 1960.

For example, suppose we have the alphabet {a,b}, and want to store the sequencesaaa, aab, baa, bbab

Page 9: 15-211 Fundamental Structures of Computer Science

Tries

aaa, aab, baa, bbab

No information stored in nodes per se.

Shape of trie determines the items.

Page 10: 15-211 Fundamental Structures of Computer Science

Tries

Nodes can be quite large: N pointers to children, where N is the size of the alphabet.

Operations are very fast:

search, insert, delete are all O(k) where k is the length of the sequence in question.

Page 11: 15-211 Fundamental Structures of Computer Science

Keypad IM Trie

4 5 9

4 6 6

5 8 8

3 3

I

like love

you

Page 12: 15-211 Fundamental Structures of Computer Science

Implementing a Trie

How do we implement a trie? How about a tree of arrays?

You can also use HashMaps to implement a trie (later)

A C D

A

Page 13: 15-211 Fundamental Structures of Computer Science

Hashing

Page 14: 15-211 Fundamental Structures of Computer Science

Why Hashing?

Suppose we need to find a better way to maintain a table that is easy to insert and search.

If we use a sorted list, you can do binary search and insertion in log2n time.

So is there an alternative way to handle operations such as insert, search, delete?

Yes, Hashing

Page 15: 15-211 Fundamental Structures of Computer Science

Big Idea

Suppose we have M items we need to put into a table of size N.

Can we find a Map H such that H[ith item] [0..N-1]?

Assume that N = 5 and the values we need to insert are: cab, bea, bad etc.

Assume that we assign values to letters: a=0, b=1, c=2, etc

Page 16: 15-211 Fundamental Structures of Computer Science

Big Idea Ctd..

Define H such that H[data] = ( characters) Mod N

H[cab] = (0+1+2) Mod 5 = 3 H[bea] = (1+4+0) Mod 5 = 0 H[bad] = (1+0+3) Mod 5 = 4

bea cab bad

0 1 2 3 4

Page 17: 15-211 Fundamental Structures of Computer Science

Problems

CollisionsWhat if the values we need to insert are “abc”,

“cba”, “bca” etc…They all map to the same location based on

our map H (obviously H is not a good hash map)

This is called “Collision” One way to deal with collisions is “separate

chaining” The idea is to maintain an array of linked lists More on collisions later

Page 18: 15-211 Fundamental Structures of Computer Science

Running Time

We need to make sure that H is easy to compute (constant time)

Lookups and deletes from the hash table depends on H

Assume M = theta(N) So what is a “bad” H? Suppose we hash strings by simply adding up

the letters and taking it mod the table size. Is that good or bad?

Homework: Think of hashing 1000 5-letter words into a table of size 1000 using the map H. What would be the key distribution like?

Page 19: 15-211 Fundamental Structures of Computer Science

What is a good H?

If H behaves likes a random function, there are N-1 other keys with equal probability(1/M) that can collide with the given Key.

Therefore E(collision of a Key) = (N-1)/M If M = Theta(N) then this value is 1. This is great.

But life is not fair. So what is a good Hash function? Lets consider a hashing a set of strings Si. Say

each Si is of some length i. Consider H(Si) = ( Si[j].d

j ) Mod M, where d is some large number and M is the table size.

Is this function hard to calculate?

Page 20: 15-211 Fundamental Structures of Computer Science

Collisions

Hash functions can be many-to-1They can map different search keys to

the same hash key.hash(`a`) == 9 == hash(`w`)

Page 21: 15-211 Fundamental Structures of Computer Science

Collisions

Hash functions can be many-to-1They can map different search keys to

the same hash key.hash1(`a`) == 9 == hash1(`w`)

Must compare the search key with the record found

Page 22: 15-211 Fundamental Structures of Computer Science

Collisions

Hash functions can be many-to-1They can map different search keys to

the same hash key.hash1(`a`) == 9 == hash1(`w`)

Must compare the search key with the record found If the match fails, there is a collision

Page 23: 15-211 Fundamental Structures of Computer Science

Collision strategies

Separate chaining Open addressing

LinearQuadraticDoubleEtc.

The perfect hash

Page 24: 15-211 Fundamental Structures of Computer Science

Linear Probing

The idea:Table remains a simple array On insert, if the cell is full, find another

by sequentially searching for the next available slot

On find, if the cell doesn’t match, look elsewhere.

Eg: Consider H(key) = key Mod 6 (assume N=6) H(11)=5, H(10)=4, H(17)=5, H(16)=4,H(23)=5 Draw the Hash table

Page 25: 15-211 Fundamental Structures of Computer Science

Linear Probe ctd..

How about deleting items?Item in a hash table connects to others

in the table(eg: BST). “Lazy Delete” – Just mark the items

active or delete rather than removing it.

Page 26: 15-211 Fundamental Structures of Computer Science

More on Delete

Naïve removal can leave gaps! Insert f

Remove e Find f

0 a

2 b3 c3 e5 d

8 j8 u

10 g8 s

0 a

2 b3 c

5 d3 f

8 j8 u

10 g8 s

0 a

2 b3 c3 e5 d3 f

8 j8 u

10 g8 s

0 a

2 b3 c

5 d3 f

8 j8 u

10 g8 s

“3 f” means search key f and hash key 3

Page 27: 15-211 Fundamental Structures of Computer Science

More about delete ctd..

Clever removal shrinks the table Insert f

Remove e Find f

0 a

2 b3 c3 e5 d

8 j8 u

10 g8 s

0 a

2 b3 c

gone5 d3 f

8 j8 u

10 g8 s

0 a

2 b3 c3 e5 d3 f

8 j8 u

10 g8 s

0 a

2 b3 c

gone5 d3 f

8 j8 u

10 g8 s

“3 f” means search key f and hash key 3

Page 28: 15-211 Fundamental Structures of Computer Science

Performance of linear probing

Average numbers of probesUnsuccessful search and insert

½ (1 + 1/(1-)2)

Succesful search ½ (1 + 1/(1-))

When is low:Probe counts are close to 1

When is high: E.g., when = 0.75, unsuccessful probes:

½ (1 + 1/(1-)2) = ½ (1 + 16) = 8.5 probes E.g., when = 0.5, unsuccessful probes:

½ (1 + 1/(1-)2) = ½ (1 + 4) = 2.5 probes

Page 29: 15-211 Fundamental Structures of Computer Science

Quadratic probing

Resolve collisions by examining certain cells away from the original probe point

Collision policy:Define h0(k), h1(k), h2(k), h3(k), …

where hi(k) = (hash(k) + i2) mod size

Caveat: May not find a vacant cell!

• Table must be less than half full ( < ½)(Linear probing always finds a cell.)

Page 30: 15-211 Fundamental Structures of Computer Science

Quadratic probing

Another issueSuppose the table size is 16.Probe offsets that will be tried:

1 4 9162536496481

Page 31: 15-211 Fundamental Structures of Computer Science

Quadratic probing

Another issueSuppose the table size is 16.Probe offsets that will be tried:

1 mod 16 = 4 mod 16 = 9 mod 16 = 16 mod 16 = 25 mod 16 = 36 mod 16 = 49 mod 16 = 64 mod 16 = 81 mod 16 =

Page 32: 15-211 Fundamental Structures of Computer Science

Quadratic probing

Another issueSuppose the table size is 16.Probe offsets that will be tried:

1 mod 16 = 1 4 mod 16 = 4 9 mod 16 = 916 mod 16 = 025 mod 16 = 936 mod 16 = 4 49 mod 16 = 164 mod 16 = 081 mod 16 = 1

Page 33: 15-211 Fundamental Structures of Computer Science

Quadratic probing

Another issueSuppose the table size is 16.Probe offsets that will be tried:

1 mod 16 = 1 4 mod 16 = 4 9 mod 16 = 916 mod 16 = 025 mod 16 = 9 only four different values!

36 mod 16 = 4 49 mod 16 = 164 mod 16 = 081 mod 16 = 1

Page 34: 15-211 Fundamental Structures of Computer Science

Quadratic probing

Table size must be prime Load factor must be less than ½

Page 35: 15-211 Fundamental Structures of Computer Science

Rehash

Scaling upWhat makes grow too large?

• Too much data• Too many removals

Rehash!Do when insert fails or load factor growsBuild a new table

• Scan existing table and do inserts into new table

Page 36: 15-211 Fundamental Structures of Computer Science

Rehash

Scaling upWhat makes grow too large?

• Too much data• Too many removals

Rehash!Do when insert fails or load factor growsBuild a new table

• Scan existing table and do inserts into new table

Twice the size or moreAdds only constant average cost

Page 37: 15-211 Fundamental Structures of Computer Science

Double Hashing

Collision policyDefine h0(k), h1(k), h2(k), h3(k), …

where hi(k) = (hash(k) + i*hash2(k)) mod size

Caveatshash2(k) must never be zeroTable size must be prime

• If multiples of hash2 results are equal to table size, fewer alternative cells will be tried.

Quadratic probing may be faster/easier in practice.

Page 38: 15-211 Fundamental Structures of Computer Science

About HashMap Class

This implements the Map interface HashMap permits null values and null keys. Constant time performance for get and put

operations HashMap has two parameters that affect its

performance: initial capacity and load factor Capacity – number of buckets in the Hash Table Load factor – How full the Hash Table is allowed to get

before capacity is automatically increased using rehash function.

Page 40: 15-211 Fundamental Structures of Computer Science

Next Week

HW2 is due Monday – you need to start early

We will discuss Priority Queues and recap of what we have done so far

Read More about Hash Tables in Chapter 20

See you Tuesday