Upload
freya
View
50
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Dictionaries, Tries and Hash Maps. 15-211 Fundamental Structures of Computer Science. Jan. 30, 2003. Ananda Guna. Based on lectures given by Peter Lee, Avrim Blum, Danny Sleator, William Scherlis, Ananda Guna & Klaus Sutner. Dictionaries. A data structure that supports lookups and updates - PowerPoint PPT Presentation
Citation preview
15-211Fundamental Structuresof Computer Science
Jan. 30, 2003Ananda Guna
Dictionaries, Tries and Hash Maps
Based on lectures given by Peter Lee, Avrim Blum, Danny Sleator, William Scherlis, Ananda Guna & Klaus Sutner
Dictionaries
A data structure that supports lookups and updates
Computational complexity: The time and space needed to authenticate the
dictionary, i.e. creating and updating it. The time needed to perform an authenticated
membership query. The time needed to verify the answer to an
authenticated membership query. Implementing a Dictionary
– A sorted array?– A sorted linked list?– A BST?– Array of Linked Lists?– Trie?
What is a Trie?
Trie of size (h, b) is a tree of height h and branching factor at most b
All keys can be regarded as integers in range [0, bh]
Each key K can be represented as h-digit number in base b: K1K2K3…Kh
Keys are stored in the leaf level; path from the root resembles decomposition of the keys to digits
root
20 22 24 31 32 42 43
23
4
0 2 4 1 2 2 3
Motivation for tries:An Application of Trees
Sample problem: How to do instant messaging on a telephone keypad?
Typing “I love you”: 444 * 555 666 888 33 * 999 666 88 #
What is the worst-case number of keystrokes?
What might be the average number of keystrokes?
Better approach: 4* 5 6 8 3* 9 6 8#
Can be done by using tries.
Tries
A trie is a data structure that stores the information about the contents of each node in the path from the root to the node, rather than the node itself
A tree structure that encodes the possible sequences of symbols in a dictionary. Invented by Fredkin in 1960.
For example, suppose we have the alphabet {a,b}, and want to store the sequencesaaa, aab, baa, bbab
Tries
aaa, aab, baa, bbab
No information stored in nodes per se.
Shape of trie determines the items.
Tries
Nodes can be quite large: N pointers to children, where N is the size of the alphabet.
Operations are very fast:
search, insert, delete are all O(k) where k is the length of the sequence in question.
Keypad IM Trie
4 5 9
4 6 6
5 8 8
3 3
I
like love
you
Implementing a Trie
How do we implement a trie? How about a tree of arrays?
You can also use HashMaps to implement a trie (later)
A C D
A
Hashing
Why Hashing?
Suppose we need to find a better way to maintain a table that is easy to insert and search.
If we use a sorted list, you can do binary search and insertion in log2n time.
So is there an alternative way to handle operations such as insert, search, delete?
Yes, Hashing
Big Idea
Suppose we have M items we need to put into a table of size N.
Can we find a Map H such that H[ith item] [0..N-1]?
Assume that N = 5 and the values we need to insert are: cab, bea, bad etc.
Assume that we assign values to letters: a=0, b=1, c=2, etc
Big Idea Ctd..
Define H such that H[data] = ( characters) Mod N
H[cab] = (0+1+2) Mod 5 = 3 H[bea] = (1+4+0) Mod 5 = 0 H[bad] = (1+0+3) Mod 5 = 4
bea cab bad
0 1 2 3 4
Problems
CollisionsWhat if the values we need to insert are “abc”,
“cba”, “bca” etc…They all map to the same location based on
our map H (obviously H is not a good hash map)
This is called “Collision” One way to deal with collisions is “separate
chaining” The idea is to maintain an array of linked lists More on collisions later
Running Time
We need to make sure that H is easy to compute (constant time)
Lookups and deletes from the hash table depends on H
Assume M = theta(N) So what is a “bad” H? Suppose we hash strings by simply adding up
the letters and taking it mod the table size. Is that good or bad?
Homework: Think of hashing 1000 5-letter words into a table of size 1000 using the map H. What would be the key distribution like?
What is a good H?
If H behaves likes a random function, there are N-1 other keys with equal probability(1/M) that can collide with the given Key.
Therefore E(collision of a Key) = (N-1)/M If M = Theta(N) then this value is 1. This is great.
But life is not fair. So what is a good Hash function? Lets consider a hashing a set of strings Si. Say
each Si is of some length i. Consider H(Si) = ( Si[j].d
j ) Mod M, where d is some large number and M is the table size.
Is this function hard to calculate?
Collisions
Hash functions can be many-to-1They can map different search keys to
the same hash key.hash(`a`) == 9 == hash(`w`)
Collisions
Hash functions can be many-to-1They can map different search keys to
the same hash key.hash1(`a`) == 9 == hash1(`w`)
Must compare the search key with the record found
Collisions
Hash functions can be many-to-1They can map different search keys to
the same hash key.hash1(`a`) == 9 == hash1(`w`)
Must compare the search key with the record found If the match fails, there is a collision
Collision strategies
Separate chaining Open addressing
LinearQuadraticDoubleEtc.
The perfect hash
Linear Probing
The idea:Table remains a simple array On insert, if the cell is full, find another
by sequentially searching for the next available slot
On find, if the cell doesn’t match, look elsewhere.
Eg: Consider H(key) = key Mod 6 (assume N=6) H(11)=5, H(10)=4, H(17)=5, H(16)=4,H(23)=5 Draw the Hash table
Linear Probe ctd..
How about deleting items?Item in a hash table connects to others
in the table(eg: BST). “Lazy Delete” – Just mark the items
active or delete rather than removing it.
More on Delete
Naïve removal can leave gaps! Insert f
Remove e Find f
0 a
2 b3 c3 e5 d
8 j8 u
10 g8 s
0 a
2 b3 c
5 d3 f
8 j8 u
10 g8 s
0 a
2 b3 c3 e5 d3 f
8 j8 u
10 g8 s
0 a
2 b3 c
5 d3 f
8 j8 u
10 g8 s
“3 f” means search key f and hash key 3
More about delete ctd..
Clever removal shrinks the table Insert f
Remove e Find f
0 a
2 b3 c3 e5 d
8 j8 u
10 g8 s
0 a
2 b3 c
gone5 d3 f
8 j8 u
10 g8 s
0 a
2 b3 c3 e5 d3 f
8 j8 u
10 g8 s
0 a
2 b3 c
gone5 d3 f
8 j8 u
10 g8 s
“3 f” means search key f and hash key 3
Performance of linear probing
Average numbers of probesUnsuccessful search and insert
½ (1 + 1/(1-)2)
Succesful search ½ (1 + 1/(1-))
When is low:Probe counts are close to 1
When is high: E.g., when = 0.75, unsuccessful probes:
½ (1 + 1/(1-)2) = ½ (1 + 16) = 8.5 probes E.g., when = 0.5, unsuccessful probes:
½ (1 + 1/(1-)2) = ½ (1 + 4) = 2.5 probes
Quadratic probing
Resolve collisions by examining certain cells away from the original probe point
Collision policy:Define h0(k), h1(k), h2(k), h3(k), …
where hi(k) = (hash(k) + i2) mod size
Caveat: May not find a vacant cell!
• Table must be less than half full ( < ½)(Linear probing always finds a cell.)
Quadratic probing
Another issueSuppose the table size is 16.Probe offsets that will be tried:
1 4 9162536496481
Quadratic probing
Another issueSuppose the table size is 16.Probe offsets that will be tried:
1 mod 16 = 4 mod 16 = 9 mod 16 = 16 mod 16 = 25 mod 16 = 36 mod 16 = 49 mod 16 = 64 mod 16 = 81 mod 16 =
Quadratic probing
Another issueSuppose the table size is 16.Probe offsets that will be tried:
1 mod 16 = 1 4 mod 16 = 4 9 mod 16 = 916 mod 16 = 025 mod 16 = 936 mod 16 = 4 49 mod 16 = 164 mod 16 = 081 mod 16 = 1
Quadratic probing
Another issueSuppose the table size is 16.Probe offsets that will be tried:
1 mod 16 = 1 4 mod 16 = 4 9 mod 16 = 916 mod 16 = 025 mod 16 = 9 only four different values!
36 mod 16 = 4 49 mod 16 = 164 mod 16 = 081 mod 16 = 1
Quadratic probing
Table size must be prime Load factor must be less than ½
Rehash
Scaling upWhat makes grow too large?
• Too much data• Too many removals
Rehash!Do when insert fails or load factor growsBuild a new table
• Scan existing table and do inserts into new table
Rehash
Scaling upWhat makes grow too large?
• Too much data• Too many removals
Rehash!Do when insert fails or load factor growsBuild a new table
• Scan existing table and do inserts into new table
Twice the size or moreAdds only constant average cost
Double Hashing
Collision policyDefine h0(k), h1(k), h2(k), h3(k), …
where hi(k) = (hash(k) + i*hash2(k)) mod size
Caveatshash2(k) must never be zeroTable size must be prime
• If multiples of hash2 results are equal to table size, fewer alternative cells will be tried.
Quadratic probing may be faster/easier in practice.
About HashMap Class
This implements the Map interface HashMap permits null values and null keys. Constant time performance for get and put
operations HashMap has two parameters that affect its
performance: initial capacity and load factor Capacity – number of buckets in the Hash Table Load factor – How full the Hash Table is allowed to get
before capacity is automatically increased using rehash function.
More on Java HashMaps
HashMap(int initialCapacity) put(Object key, Object value) get(Object key) – returns Object keySet() Returns a set view of the
keys contained in this map. values() Returns a collection view
of the values contained in this map.
Next Week
HW2 is due Monday – you need to start early
We will discuss Priority Queues and recap of what we have done so far
Read More about Hash Tables in Chapter 20
See you Tuesday