Module 0: Introduction *4mm CS 240 Data Structures and ...cs240/w18/modules/module07.pdf · Outline 1 DictionariesviaHashing HashingIntroduction SeparateChaining OpenAddressing HashFunctionStrategies

Module 7: Dictionaries via Hashing

CS 240 – Data Structures and Data Management

T. Biedl K. Lanctot M. Sepehri S. WildBased on lecture notes by many previous cs240 instructors

David R. Cheriton School of Computer Science, University of Waterloo

Winter 2018

References: Sedgewick 12.2, 14.1-4

version 2018-03-07 14:23

Biedl, Lanctot, Sepehri, Wild (SCS, UW) CS240 – Module 7 Winter 2018 1 / 23

Outline

1 Dictionaries via HashingHashing IntroductionSeparate ChainingOpen AddressingHash Function Strategies

Biedl, Lanctot, Sepehri, Wild (SCS, UW) CS240 – Module 7 Winter 2018

Outline



Lower bound for search

The fastest implementations of the dictionary ADT require Θ(log n) timeto search a dictionary containing n items. Is this the best possible?

Theorem: In the comparison model (on the keys),Ω(log n) comparisons are required to search a size-n dictionary.

Proof: Similar to lower bound for sorting.Any algorithm defines a binary decision tree withcomparisons at the nodes and actions at the leaves.There are at least n + 1 different actions (return an item, or “not found”).So there are Ω(n) leaves, and therefore the height is Ω(log n).












Direct AddressingRequirement: For a given M ∈ N,every key k is an integer with 0 ≤ k < M.Data structure: An array A of size M that stores (k, v) via A[k]← v .Example: M = 9, the dictionary stores (2, dog), (6, cat) and (8, pig).

0

1

dog2

3

4

5

cat6

7

pig8


Direct Addressing Runtime

Requirement: For a given M ∈ N,every key k is an integer with 0 ≤ k < M.

Data structure : An array A of size M that stores (k, v) via A[k]← v .

search(k) : Check whether A[k] is emptyinsert(k, v) : A[k]← vdelete(k) : A[k]← empty

Each operation is Θ(1).Total storage is Θ(M).

What sorting algorithm does this remind you of? Counting Sort







What sorting algorithm does this remind you of?

Counting Sort







What sorting algorithm does this remind you of? Counting Sort


Hashing

Direct addressing isn’t possible if keys are not integers.And the storage is very wasteful if n M.

Say keys come from some universe U.

Hash function h : U → 0, 1, . . . ,M − 1 maps keys to integers.

Uniform Hashing Assumption: Each hash function value is equally likely.

This depends on the input and how we choose the function (later).

Hash table Dictionary: Array T of size M (the hash table).An item with key k is stored in T [h(k)].


Hashing

Direct addressing isn’t possible if keys are not integers.And the storage is very wasteful if n M.

Say keys come from some universe U.

Hash function h : U → 0, 1, . . . ,M − 1 maps keys to integers.


This depends on the input and how we choose the function (later).

Hash table Dictionary: Array T of size M (the hash table).An item with key k is stored in T [h(k)].


Hashing exampleU = integers, M = 11, h(k) = k mod 11.The hash table stores keys 7, 13, 43, 45, 49, 92. (Values are not shown).

0

451

132

3

924

495

6

77

8

9

4310


Collisions

Generally hash function h is not injective, so many keys can map tothe same integer.

I For example, h(46) = 2 = h(13).We get collisions: we want to insert (k, v) into the table,but T [h(k)] is already occupied.

Two basic strategies to deal with collisions:I Allow multiple items at each table location (buckets)I Allow each item to go into multiple locations (open addressing)

We will evaluate strategies by the average cost of search, insert,delete, in terms of n, M, and/or the load factor α = n/M.

I The example has load factor 611 .

We rebuild the whole hash table and change the value of M when theload factor gets too large or too small.

I This is called rehashing , and costs Θ(M + n).


Collisions

Generally hash function h is not injective, so many keys can map tothe same integer.

I For example, h(46) = 2 = h(13).We get collisions: we want to insert (k, v) into the table,but T [h(k)] is already occupied.

Two basic strategies to deal with collisions:I Allow multiple items at each table location (buckets)I Allow each item to go into multiple locations (open addressing)

We will evaluate strategies by the average cost of search, insert,delete, in terms of n, M, and/or the load factor α = n/M.

I The example has load factor 611 .

We rebuild the whole hash table and change the value of M when theload factor gets too large or too small.

I This is called rehashing , and costs Θ(M + n).


Outline



Separate Chaining

Each table entry is a bucket containing 0 or more KVPs.This could be implemented by any dictionary (even another hash table!).

The simplest approach is to use an unsorted linked list in each bucket.This is called collision resolution by separate chaining .

search(k): Look for key k in the list at T [h(k)].insert(k, v): Add (k, v) to the front of the list at T [h(k)].delete(k): Perform a search, then delete from the linked list.


Chaining example

M = 11, h(k) = k mod 11

insert()

h

0

451

132

3

924

495

6

77

8

9

4310


Chaining example

M = 11, h(k) = k mod 11

insert(41)

h(41) = 8

0

451

132

3

924

495

6

77

8

9

4310


Chaining example

M = 11, h(k) = k mod 11

insert(41)

h(41) = 8

0

451

132

3

924

495

6

77

418

9

4310


Chaining example

M = 11, h(k) = k mod 11

insert(46)

h(46) = 2

0

451

132

3

924

495

6

77

418

9

4310


Chaining example

M = 11, h(k) = k mod 11

insert(46)

h(46) = 2

0

451

462 133

924

495

6

77

418

9

4310


Chaining example

M = 11, h(k) = k mod 11

insert(16)

h(16) = 5

0

451

462 133

924

165 496

77

418

9

4310


Chaining example

M = 11, h(k) = k mod 11

insert(79)

h(79) = 2

0

451

792 46 133

924

165 496

77

418

9

4310


Complexity of chaining

Recall the load factor α = n/M.

Assuming uniform hashing, average bucket size is exactly α.

Analysis of operations:search Θ(1 + α) average-case, Θ(n) worst-caseinsert O(1) worst-case, since we always insert in front.delete Same cost as search: Θ(1 + α) average, Θ(n) worst-case

If we maintain M ∈ Θ(n), then average costs are all O(1).This is typically accomplished by rehashing whenever n < c1M or n > c2M,for some constants c1, c2 with 0 < c1 < c2.


Outline



Open addressing

Main idea: Each hash table entry holds only one item,but any key k can go in multiple locations.

search and insert follow a probe sequence of possible locations for key k:〈h(k, 0), h(k, 1), h(k, 2), . . .〉 until an empty spot is found.

delete becomes problematic:Cannot leave an empty spot behind; the next search might otherwisenot go far enough.Idea 1: Move later items in the probe sequence forward.Idea 2: lazy deletion. Mark spot as deleted (rather than empty) andcontinue searching past deleted spots.

Simplest idea: linear probingh(k, i) = (h(k) + i) mod M, for some hash function h.


Open addressing

Main idea: Each hash table entry holds only one item,but any key k can go in multiple locations.

search and insert follow a probe sequence of possible locations for key k:〈h(k, 0), h(k, 1), h(k, 2), . . .〉 until an empty spot is found.

delete becomes problematic:Cannot leave an empty spot behind; the next search might otherwisenot go far enough.Idea 1: Move later items in the probe sequence forward.Idea 2: lazy deletion. Mark spot as deleted (rather than empty) andcontinue searching past deleted spots.

Simplest idea: linear probingh(k, i) = (h(k) + i) mod M, for some hash function h.


Linear probing example

M = 11, h(k, i) = (h(k) + i) mod 11.

()

h

0

451

132

3

924

495

6

77

8

9

4310



M = 11, h(k, i) = (h(k) + i) mod 11.

insert(41)

h(41, 0) = 8

0

451

132

3

924

495

6

77

418

9

4310



M = 11, h(k, i) = (h(k) + i) mod 11.

insert(84)

h(84, 0) = 7

0

451

132

3

924

495

6

77

418

9

4310



M = 11, h(k, i) = (h(k) + i) mod 11.

insert(84)

h(84, 1) = 8

0

451

132

3

924

495

6

77

418

9

4310



M = 11, h(k, i) = (h(k) + i) mod 11.

insert(84)

h(84, 2) = 9

0

451

132

3

924

495

6

77

418

849

4310



M = 11, h(k, i) = (h(k) + i) mod 11.

insert(20)

h(20, 0) = 9

0

451

132

3

924

495

6

77

418

849

4310



M = 11, h(k, i) = (h(k) + i) mod 11.

insert(20)

h(20, 1) = 10

0

451

132

3

924

495

6

77

418

849

4310



M = 11, h(k, i) = (h(k) + i) mod 11.

insert(20)

h(20, 2) = 0

200

451

132

3

924

495

6

77

418

849

4310



M = 11, h(k, i) = (h(k) + i) mod 11.

delete(43)

h(43, 0) = 10

200

451

132

3

924

495

6

77

418

849

deleted10



M = 11, h(k, i) = (h(k) + i) mod 11.

search(63)

h(63, 0) = 8

200

451

132

3

924

495

6

77

418

849

deleted10



M = 11, h(k, i) = (h(k) + i) mod 11.

search(63)

h(63, 1) = 9

200

451

132

3

924

495

6

77

418

849

deleted10



M = 11, h(k, i) = (h(k) + i) mod 11.

search(63)

h(63, 2) = 10

200

451

132

3

924

495

6

77

418

849

deleted10



M = 11, h(k, i) = (h(k) + i) mod 11.

search(63)

h(63, 3) = 0

200

451

132

3

924

495

6

77

418

849

deleted10



M = 11, h(k, i) = (h(k) + i) mod 11.

search(63)

h(63, 4) = 1

200

451

132

3

924

495

6

77

418

849

deleted10



M = 11, h(k, i) = (h(k) + i) mod 11.

search(63)

h(63, 5) = 2

200

451

132

3

924

495

6

77

418

849

deleted10



M = 11, h(k, i) = (h(k) + i) mod 11.

search(63)

h(63, 6) = 3not found

200

451

132

3

924

495

6

77

418

849

deleted10


Probe sequence operations

probe-sequence-insert(T , (k, v))1. for (j = 0; j < M; j++)2. if T [h(k, j)] is “empty” or “deleted”3. T [h(k, j)] = (k, v)4. return “success”5. return “failure to insert”

probe-sequence-search(T , k)1. for (j = 0; j < M; j++)2. if T [h(k, j)] is “empty”3. return “item not found”4. else if T [h(k, j)] has key k5. return T [h(k, j)]6. // ignore “deleted” and keep searching7. return “item not found”


Double HashingSay we have two hash functions h1, h2 that are independent.

So, under uniform hashing, we assume the probability that a key khas h1(k) = a and h2(k) = b, for any particular a and b, is

1M2 .

For double hashing , define h(k, i) = h1(k) + i · h2(k) mod M whereh2(k) 6= 0 for any k.

search, insert, delete work just like for linear probing,but with this different probe sequence. To get valid probe sequences, we

need gcd(h2(k),M) = 1! Choose M prime.


Double hashing example

M = 11, h1(k) = k mod 11, h2(k) = (bk/2c mod 10) + 1

0

451

132

3

924

495

6

77

8

9

4310




insert(41)

h1(41) = 8

h(41, 0) = 8

0

451

132

3

924

495

6

77

418

9

4310




insert(117)

h1(117) = 7

h(117, 0) = 7

0

451

132

3

924

495

6

77

418

9

4310




insert(117)

h1(117) = 7

h(117, 0) = 7

h2(117) = 9

h(117, 1) = 5

0

451

132

3

924

495

6

77

418

9

4310




insert(117)

h1(117) = 7

h(117, 0) = 7

h2(117) = 9

h(117, 1) = 5

h(117, 2) = 3

0

451

132

1173

924

495

6

77

418

9

4310


Cuckoo hashing

This is a relatively new idea from Pagh and Rodler in 2001.

Again, we use two independent hash functions h1, h2.

Main idea: An item with key k can only be in T [h1(k)] or T [h2(k)].Search and Delete then take constant time.Insert always puts a new item into T [h1(k)].If T [h1(k)] was occupied: “kick out” the other item, which we thenattempt to re-insert into its alternate position.This may lead to a loop of “kicking out”. We detect this by abortingafter too many attempts.In case of failure: rehash with a larger M and new hash functions.

Insert may be slow, but is expected to be constant time if the loadfactor is small enough.


Cuckoo hashing insertion

cuckoo-insert(T , x)T : hash table, x : new item to insert1. y ← x , i ← h1(x .key)2. do at most n times:3. swap(y ,T [i ])4. if y is “empty” then return “success”5. // swap i to be the other hash-location6. if i = h1(y .key) then i ← h2(y .key)7. else i ← h1(y .key)8. return “failure”


Cuckoo hashing example

M = 11, h1(k) = k mod 11, h2(k) = b11(ϕk − bϕkc)c

()

y .key =

h1(y .key) =h2(y .key) =

i =

440

1

2

3

264

5

6

7

8

929

10




insert(51)

y .key = 51

h1(y .key) = 7h2(y .key) = 5

i = 7

440

1

2

3

264

5

6

7

8

929

10




insert(51)

y .key = 51

h1(y .key) = 7h2(y .key) = 5

i = 7

440

1

2

3

264

5

6

517

8

929

10




insert(95)

y .key = 95

h1(y .key) = 7h2(y .key) = 7

i = 7

440

1

2

3

264

5

6

517

8

929

10




insert(95)

y .key = 51

h1(y .key) = 7h2(y .key) = 5

i = 5

440

1

2

3

264

5

6

957

8

929

10

51




insert(95)

y .key = 51

h1(y .key) = 7h2(y .key) = 5

i = 5

440

1

2

3

264

515

6

957

8

929

10




insert(97)

y .key = 97

h1(y .key) = 9h2(y .key) = 10

i = 9

440

1

2

3

264

515

6

957

8

929

10




insert(97)

y .key = 92

h1(y .key) = 4h2(y .key) = 9

i = 4

440

1

2

3

264

515

6

957

8

979

10

92




insert(97)

y .key = 26

h1(y .key) = 4h2(y .key) = 0

i = 0

440

1

2

3

924

515

6

957

8

979

10

26




insert(97)

y .key = 44

h1(y .key) = 0h2(y .key) = 2

i = 2

260

1

2

3

924

515

6

957

8

979

10

44




insert(97)

y .key = 44

h1(y .key) = 0h2(y .key) = 2

i = 2

260

1

442

3

924

515

6

957

8

979

10




search(26)

y .key =

h1(26) = 4h2(26) = 0

260

1

442

3

924

515

6

957

8

979

10




delete(26)

y .key =

h1(26) = 4h2(26) = 0

0

1

442

3

924

515

6

957

8

979

10


Outline



Choosing a good hash function


Proving is usually impossible, as it requires knowledge ofthe input distribution and the hash function distribution.

We can get good performance by following a few rules.

A good hash function should:be very efficient to computebe unrelated to any possible patterns in the datadepend on all parts of the key


Basic hash functions

If all keys are integers (or can be mapped to integers),the following two approaches tend to work well:

Modular method: h(k) = k mod M.We should choose M to be a prime.

Multiplicative method: h(k) = bM(kA− bkAc)c,for some constant floating-point number A with 0 < A < 1.

Knuth suggests A = ϕ =√5− 12 ≈ 0.618.


Universal Hashing

Every hash functions must fail for some sequences of inputs.Everything hashes to same value. terrible worst case!Rescue: Randomization!

choose a random basic hash function, e.g.

h(k) =((ak + b) mod p

)mod M

for a fixed prime p > M and random numbers a, b ∈ 0, . . . p − 1,a 6= 0.

can prove:for any (fixed) numbers x 6= y , the probability of a collision using arandom h is at most 1

M .

Once again: Can enforce same expected performance for any input as wehad without randomization on average inputs.


Multi-dimensional Data

What if the keys are multi-dimensional, such as strings?

Standard approach is to flatten string w to integer f (w) ∈ N, e.g.

A · P · P · L · E → (65, 80, 80, 76, 69) (ASCII)→ 65R4 + 80R3 + 80R2 + 76R1 + 68R0

(for some radix R, e.g. R = 255)

We combine this with a standard hash functionh : N→ 0, 1, 2, . . . ,M − 1.

With h(f (k)) as the hash values, we then use any standard hash table.

Note: computing each h(f (k)) takes Ω(length of w) time.


Hashing vs. Balanced Search Trees

Advantages of Balanced Search TreesO(log n) worst-case operation costDoes not require any assumptions, special functions,or known properties of input distributionPredictable space usage (exactly n nodes)Never need to rebuild the entire structuresupports ordered dictionary operations (rank, select etc.)

Advantages of Hash TablesO(1) operations (if hashes well-spread and load factor small)We can choose space-time tradeoff via load factorCuckoo hashing achieves O(1) worst-case for search & delete


Documents

Module 0: Introduction *4mm CS 240 Data Structures and ...cs240/w18/modules/module07.pdf · Outline 1 DictionariesviaHashing HashingIntroduction SeparateChaining OpenAddressing HashFunctionStrategies