Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1

1

Hash Tables

Universal Families of Hash Functions

Bloom Filters

Wednesday, July 23rd

2

Outline For Today

1. Hash Tables and Universal Hashing

2. Bloom Filters

3

Outline For Today


2. Bloom Filters

4

Hash TablesRandomized Data Structure

Implementing the “Dictionary” abstract data type

(ADT)

Insert

Delete

Lookup

We’ll assume no duplicates

Applications

Symbol Tables/Compilers: which variables already

declared?

ISPs: is IP address spam/blacklisted?

many others

5

SetupUniverse U of all possible elements

All possible 232 IP addresses

All possible variables that can be declared

Maintain a possibly evolving subset S ⊆ U

|S|=m and |U| >> m

S might be evolving

6

Naïve Dictionary Implementations (1)Option 1: Bit Vectors

An array A, keeping one bit 0/1 for each element of

U

Insert element i => A[i] = 1

Delete element i => A[i] = 0

Lookup element i => return A[i] == 1

Time complexity of every operation is O(1)

Space: O(U) (e.g. 232 for IP addresses)

Quick but Not Scalable!

7

Naïve Dictionary Implementations (2)Option 2: Linked List

One entry for each element in S

Insert element i => check if i exists, if not append to

list

Delete element i => find i in the list and remove

Lookup element i => go through the entire list

Time complexity of every operation is O(|S|)

Space: O(S) space

Scalable but Not Quick!

8

Hash Tables: Best of Both WorldsRandomized Dictionary that is:

Quick: O(1) expected time for each operation

Scalable: O(S) space

9

Hash Tables: High-level IdeaBuckets: distinct locations in the hash table

Let n be # buckets

n ≈ m (recall m =|S|)

i.e., Load factor: m/n = O(1)

Hash Function h: U -> {0, 1, …, n-1}

We store each element x in bucket h(x)

U: universem = size of Sn = # buckets

10

Hash Tables: High-level Idea

0

1

..

n-2

n-1

hU

…


11

CollisionsMultiple elements are hashed to the same bucket.

Assume we are about to insert new x and h(x) is

already full

Resolving Collisions:

Chaining: Linked list per bucket; append x to the list

Open Addressing: If h(x) is already full, we

deterministically assign x to another empty bucket

Saves space

12

Chaining

0

1

..

n-2

n-1

NullNull

NullNull

…

e3 = h(e3)=1


13

Chaining

0

1

..

n-2

n-1

Null

NullNull

…

e3

e7 = h(e7)=n-2


14

Chaining

0

1

..

n-2

n-1

Null

Null

…

e3

e5 = h(e5)=n-1

e7


15

Chaining

0

1

..

n-2

n-1

Null

…

e3

e1 = h(e1)=1

e7

e5


16

Chaining

0

1

..

n-2

n-1

Null

…

e3

e7

e5

e1

e4 = h(e4)=1


17

Chaining

0

1

..

n-2

n-1

Null

…

e3

e7

e5

e1 e4


18

Operations (With Chaining) Insert(x): Go to bucket h(x); If x is not in list, append it.

Delete(x): Go to bucket h(x); If x is in list, delete it.

Lookup(x): Go to bucket h(x); Return true if x is in the

list

19

Running Time of OperationsAssume evaluating the hash function is constant time

May not be true for all hash functions

Consider an element x

0

1

..

n-2

n-1

Null

…

e3

e7

e5

e1 e4

Lookup:

O(|Linked list h(x)|)

Insert: O(|Linked list h(x)|)

Delete:

O(|Linked list h(x)|)


20

Worst & Best ScenariosAssume m: # elements in the hash table

Worst Case: O(m)

Best Case: O(1)

|Linked lists| depends on the quality of the hash function!

Fundamental Question: How can we choose “good” hash functions?


21

Bad Hash FunctionsRecall our IP addresses example: 32 bits

# buckets n = 28

Idea: Use most significant 8 bits

Big correlations with geography of how IP addresses are

assigned: 171, 172 as the first 8 bits is common

Lots of addresses would get mapped to the same

bucket

In practice should be very careful when

picking hash functions!U: universem = size of Sn = # buckets

22

Is There A Single Good Hash Function? Idea: Design a clever hash function **h** that spreads

every data sets evenly across the buckets.

Problem: Cannot exist!

0

1

..

n-2

n-1

**h**

U Recall |U| >> m≈n

by pigeonhole:∃bucket i, s.t. |list i| ≥ |U|/n

If S is all from i, then all operations O(m)!

23

No Single Good Hash Function!

Claim: For every single hash function

h, there is a pathological data set!

Proof: By pigeonhole principle

24

Solution: Pick a Hash Function RandomlyDesign a set or a “family” H of hash functions,

s.t. ∀ data sets S, if we pick a h∈ H randomly,

then almost always we spread S out evenly

across buckets.

Question: Why couldn’t you have randomness inside

your hash function?

Clarification on Proposed Analysis

Hash TableInput:S

Performance

Pick h randomly from H

We’ll analyze the expected performance on any but fixed input S.

Clarification on Proposed Analysis

Hash TableInput:S Performance1

Pick h1 randomly from H

Hash TableInput:S Performance2

Pick h2 randomly from H …

Hash TableInput:S Performancet

Pick ht randomly from H

27

Roadmap

1. Define H being “Universal”

2. If H is universal and we pick h ∈H

randomly, then our hash table has O(1)

expected cost

3. Show simple and practical H exist.

28

1. Universal Family of Hash Functions

Let H be a set of functions from |U| to {0, 1, …, n-1}.

Definition: H is universal if ∀ x, y ∈ U, s.t. x ≠ y,

if h is chosen uniformly at random from H then:

Pr(h(x) = h(y)) ≤ 1/n

I.e., the fraction of hash functions of H that make

x & y collide is at most 1/n

Why 1/n?

“As if we were independently mapping x, y to

buckets (& uniformly at random).”U: universem = size of Sn = # buckets

29

2. Universality => Operations Are O(1)

Let H be a universal family of hash functions from |U|

to {0, 1, …, n-1}.

Recall m = O(n)

Claim: If h is picked randomly from H => for any

data set S, hash table operations are O(1).


30


Proof:


Hash TableS

Pick h randomly from H

0

1

..

n-2

n-1

…e3

e7

e5

e1 e4

e9 e27

A new element x arrives. Say we want to perform

Lookup(x).

Cost: O(# elements in bucket h(x)).

This quantity is a random variable. Call it

Z.

31


Proof Continued: Z=# elements in bucket h(x).

For each element y ∈ S, let Xy be 1 if h(y) = h(x).


1 is in case x is already there

Q.E.D

32

3. Universal Families of HF Exist (1)

Let n=2b, |U|=2t and t>b

Represent each x as t bit binary vector

Ex: |U|=27=128, hash table has size 24=16

|U| = 2t n = 2b

0 1 1 0 1 0 1

1 0 0 1 1 0 0

1 1 1 0 0 0 1

0 0 1 1 0 0 0

0

1

1

0

1

0

0

1

1

0

0

M x h(x)=Mx

Random 0/1 b x t matrix

=

multiplication mod 2

bucket 12

element52

3. Universal Families of HF Exist (2)

33

h(x): Mx: 2t -> 2b or U -> {0, 1, …, n-

1}

H = All possible b x t 0/1 random

matrices0 1 1 0 1 0 1

1 0 0 1 1 0 0

1 1 1 0 0 0 1

0 0 1 1 0 0 0

0

1

1

0

1

0

0

1

1

0

0

M x h(x)=Mx

Random 0/1 b x t matrix

=

multiplication mod 2|U| = 2t

n = 2b

Proof that H is Universal (1)

34

Need to prove that ∀ x ≠ y, Pr(h(x) = h(y)) ≤ 1/n =

1/2b, when M is picked uniformly at random from H .

=> equivalently when each cell of M is picked

randomly.

0 1 1 0 1 0 1

1 0 0 1 1 0 0

1 1 1 0 0 0 1

0 0 1 1 0 0 0

0

1

1

0

1

0

0

1

1

0

0

M x h(x)

=

|U| = 2t n = 2b


35

x, y differ in at least one bit (say w.l.o.g., the last

bit)

let z = x-yz1

z2

z3

z4

z5

…

1

0

0

0

0

M z Mz

=

0 1 1 0 1 0 1

1 0 0 1 1 0 0

1 1 1 0 0 0 1

0 0 1 1 0 0 0

Q: Pr(Mz =0)?

|U| = 2t n = 2b


36

Pr(Mz=0) = Pr(Mz[0]=0 & Mz[1]=0 & … Mz[b] = 0)

**Event Mz[i]=0 is independent from Mz[j]=0 since, the

coin flips for Mz[i] are independent from the coin flips for

Mz[j]**

Pr(Mz = 0) = Pr(Mz[0]=0) Pr(Mz[1)=0) …

Pr(Mz[b]=0)z1

z2

z3

z4

z5

…

1

0

0

0

0

M z Mz

=

0 1 1 0 1 0 1

1 0 0 1 1 0 0

1 1 1 0 0 0 1

0 0 1 1 0 0 0

Q: Pr(Mz[i] =0)?|U| = 2t n = 2b


37

Pr(Mz[i]=0)

z1

z2

z3

z4

z5

…

1

0

0

0

0

M z Mz

=

0 1 1 0 1 0 1

1 0 0 1 1 0 0

1 1 1 0 0 0 1

0 0 1 1 0 0 0

Mz[i] = mi1z1 + mi2z2 + … 1*mit

Let y be the (modulo 2) sum of the first t-1

multiplications, Mz[i] = 0 iff mit is equal to ¬y!

i


38

Pr(Mz[i] =0) = 1/2 Pr(Mx=My)=Pr(Mz = 0) = 1/2b = 1/n

Irrespective of the fist t-1 coin flips, it all

depends on the last coin flip.

|U| = 2t n = 2b

Q.E.D

Storing and Evaluating Hash Function h (M)

39

Q: How much space do we need to store the

random matrix M?

A: bt bits = O(log|U|log(n))

How much time to evaluate Mz?

A: Naïve: bt2=O(log|U|log(n))

Summary: H is a relatively fast, and

practical universal family of hash functions

Another Possible Family

40

We’re hashing from U -> {0, 1, …, n-1}

Let H be the set of all such functions

Question: Is H universal?

Another Possible Family

41

We’re hashing from U -> {0, 1, …, n-1}

Q1: # such functions?

A1: nU

Q2: # functions in which h(x) = h(y)=j?

A2: nU-2

Q3: # functions in which h(x) = h(y)?

A3: nnU-2 = nU-1

Q4: Pr(h(x) = h(y)?Answer: 1/n => H is

universal!

Why is H Impractical?

42

There are nU functions in H .

What’s cost of storing a function h fromH?

log(|H|)=O(Ulog(n)!

Not Practical!

43

Summary

1. Hash Tables

2. Defined Universal Family of Hash Functions

3. Universal family => Hash Table ops are expected O(1)

time

4. Universal families exist

44

Outline For Today


2. Bloom Filters

45

Bloom FiltersRandomized Data Structure

Implementing a limited version of Dictionary ADT Insert Lookup

Compared to Hash Tables:

Applications

Website caches for ISPs

Cons

No Deletes

Not Always Correct Output to Lookup(x) => false positives

Pros

More Space Efficient

no pointers to actual objects

inserted

46

Same Setup As Hash TablesUniverse U of all possible elements

All possible 232 IP addresses

Maintain a subset S ⊆ U

|S|=m and |U| >> m

47

Bloom Filters

A Bloom Filter consists of:

A bit array of size n initially all 0 (not

buckets)

k hash functions h1, …, hk

Space cost per element= n/m0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

48

Insertions

Insert(a): set all hi(a) to 1 => O(k)

Let k = 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0

h1(x)=2, h2(x)=9

h3(x)=0

h1(y)=1, h2(y)=5

h3(y)=91 1 1 0 0 1 0 0 0 1 0 0 0 0 0 0

h1(z)=10,

h2(z)=11

h3(z)=5

1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 0

Do you see why there would be false positives?

49

Lookup

Lookup(a): return true if all hi(a) = 1 => O(k)

x: h1(x)=2, h2(x)=9 h3(x)=0 => Lookup(x) = true

z: h1(z)=3, h2(z)=9 h3(z)=4 => Lookup(z) = false

t: h1(t)=0, h2(t)=1 h3(t)=2 => Lookup(t) = true

1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

50

Can Bloom Filters Be Useful?

Can Bloom Filters be both space efficient

and have a low false positive rate?

What is the probability of false positives as

a function of n, m and k?

51

Probability of False Positive

We have inserted m elements to the bloom

filter.

New element z arrives, not been inserted

before.

Q: What’s the Pr(false positive for z)?

Assume h1(z) = j1, …, hk(z) = jk

**Simplifying (Unjustified) Assumption**: All

hashing is totally random!

∀hi, ∀x, hi(x) is uniformly random from {1, …, m}

and independent from all hj(y) for all y.

Warning: To simplify analysis. Won’t hold in

practice.

52

Pr(bit j is 1 after m insertions)?

Consider a particular bit j in the array.

Q1: Fix hi and an element x. Pr(hi(x) turns j to 1)?

A1: 1/n

Q2: Pr(x turns j to 1)? (Prob. one of h1(x), …, hk(x)

= j?)

A2: 1-Pr(x does not turn j to 1)= 1- (1-1/n)k

Q3: Pr(Bit j = 1 after m insertions)?

A3: 1-Pr(no element turns j to 1)= 1 – (1-1/n)km

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

53

Pr(false positive for x)?

Recall for x we check k bits: h1(x) = j1, …, hk(x) =

jk

Pr(bit ji = 1) = 1 – (1-1/n)km

Pr(false positive) = Pr(all ji = 1)= (1 –

(1-1/n)km)k

Recall Calculus fact: (1+x) ≤ ex

From the same fact: around x=0, (1+x) ≈ ex

Pr(false positive) ≈

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

54

How Does Failure Rate Change With k,n?

Observation 1: as n increases failure rate

decreases.

Observation 2: as k increases(the # hash

functions)

more bits to check => less likely to fail

more bits/object => more likely to fail

unclear if it increases or decreases

Question: What’s the optimal k for fixed n/m?

Answer (by taking derivatives): k=ln(2)n/m =

0.69*n/m

Failure rate

=

55

How Does Failure Rate Change With k,n?For fixed n/m, with optimal k=ln(2)n/m

Failure rate:

Already at n=8m, rate is 1-2%.

Exponentially decrease with n/m.

56

Next Week Dynamic Programming

Documents

Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1