Upload
austen-stevenson
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
1
Hash Tables
Universal Families of Hash Functions
Bloom Filters
Wednesday, July 23rd
2
Outline For Today
1. Hash Tables and Universal Hashing
2. Bloom Filters
3
Outline For Today
1. Hash Tables and Universal Hashing
2. Bloom Filters
4
Hash TablesRandomized Data Structure
Implementing the “Dictionary” abstract data type
(ADT)
Insert
Delete
Lookup
We’ll assume no duplicates
Applications
Symbol Tables/Compilers: which variables already
declared?
ISPs: is IP address spam/blacklisted?
many others
5
SetupUniverse U of all possible elements
All possible 232 IP addresses
All possible variables that can be declared
Maintain a possibly evolving subset S ⊆ U
|S|=m and |U| >> m
S might be evolving
6
Naïve Dictionary Implementations (1)Option 1: Bit Vectors
An array A, keeping one bit 0/1 for each element of
U
Insert element i => A[i] = 1
Delete element i => A[i] = 0
Lookup element i => return A[i] == 1
Time complexity of every operation is O(1)
Space: O(U) (e.g. 232 for IP addresses)
Quick but Not Scalable!
7
Naïve Dictionary Implementations (2)Option 2: Linked List
One entry for each element in S
Insert element i => check if i exists, if not append to
list
Delete element i => find i in the list and remove
Lookup element i => go through the entire list
Time complexity of every operation is O(|S|)
Space: O(S) space
Scalable but Not Quick!
8
Hash Tables: Best of Both WorldsRandomized Dictionary that is:
Quick: O(1) expected time for each operation
Scalable: O(S) space
9
Hash Tables: High-level IdeaBuckets: distinct locations in the hash table
Let n be # buckets
n ≈ m (recall m =|S|)
i.e., Load factor: m/n = O(1)
Hash Function h: U -> {0, 1, …, n-1}
We store each element x in bucket h(x)
U: universem = size of Sn = # buckets
10
Hash Tables: High-level Idea
0
1
..
n-2
n-1
hU
…
U: universem = size of Sn = # buckets
11
CollisionsMultiple elements are hashed to the same bucket.
Assume we are about to insert new x and h(x) is
already full
Resolving Collisions:
Chaining: Linked list per bucket; append x to the list
Open Addressing: If h(x) is already full, we
deterministically assign x to another empty bucket
Saves space
12
Chaining
0
1
..
n-2
n-1
NullNull
NullNull
…
e3 = h(e3)=1
U: universem = size of Sn = # buckets
13
Chaining
0
1
..
n-2
n-1
Null
NullNull
…
e3
e7 = h(e7)=n-2
U: universem = size of Sn = # buckets
14
Chaining
0
1
..
n-2
n-1
Null
Null
…
e3
e5 = h(e5)=n-1
e7
U: universem = size of Sn = # buckets
15
Chaining
0
1
..
n-2
n-1
Null
…
e3
e1 = h(e1)=1
e7
e5
U: universem = size of Sn = # buckets
16
Chaining
0
1
..
n-2
n-1
Null
…
e3
e7
e5
e1
e4 = h(e4)=1
U: universem = size of Sn = # buckets
17
Chaining
0
1
..
n-2
n-1
Null
…
e3
e7
e5
e1 e4
U: universem = size of Sn = # buckets
18
Operations (With Chaining) Insert(x): Go to bucket h(x); If x is not in list, append it.
Delete(x): Go to bucket h(x); If x is in list, delete it.
Lookup(x): Go to bucket h(x); Return true if x is in the
list
19
Running Time of OperationsAssume evaluating the hash function is constant time
May not be true for all hash functions
Consider an element x
0
1
..
n-2
n-1
Null
…
e3
e7
e5
e1 e4
Lookup:
O(|Linked list h(x)|)
Insert: O(|Linked list h(x)|)
Delete:
O(|Linked list h(x)|)
U: universem = size of Sn = # buckets
20
Worst & Best ScenariosAssume m: # elements in the hash table
Worst Case: O(m)
Best Case: O(1)
|Linked lists| depends on the quality of the hash function!
Fundamental Question: How can we choose “good” hash functions?
U: universem = size of Sn = # buckets
21
Bad Hash FunctionsRecall our IP addresses example: 32 bits
# buckets n = 28
Idea: Use most significant 8 bits
Big correlations with geography of how IP addresses are
assigned: 171, 172 as the first 8 bits is common
Lots of addresses would get mapped to the same
bucket
In practice should be very careful when
picking hash functions!U: universem = size of Sn = # buckets
22
Is There A Single Good Hash Function? Idea: Design a clever hash function **h** that spreads
every data sets evenly across the buckets.
Problem: Cannot exist!
0
1
..
n-2
n-1
**h**
U Recall |U| >> m≈n
by pigeonhole:∃bucket i, s.t. |list i| ≥ |U|/n
If S is all from i, then all operations O(m)!
23
No Single Good Hash Function!
Claim: For every single hash function
h, there is a pathological data set!
Proof: By pigeonhole principle
24
Solution: Pick a Hash Function RandomlyDesign a set or a “family” H of hash functions,
s.t. ∀ data sets S, if we pick a h∈ H randomly,
then almost always we spread S out evenly
across buckets.
Question: Why couldn’t you have randomness inside
your hash function?
Clarification on Proposed Analysis
Hash TableInput:S
Performance
Pick h randomly from H
We’ll analyze the expected performance on any but fixed input S.
Clarification on Proposed Analysis
Hash TableInput:S Performance1
Pick h1 randomly from H
Hash TableInput:S Performance2
Pick h2 randomly from H …
Hash TableInput:S Performancet
Pick ht randomly from H
27
Roadmap
1. Define H being “Universal”
2. If H is universal and we pick h ∈H
randomly, then our hash table has O(1)
expected cost
3. Show simple and practical H exist.
28
1. Universal Family of Hash Functions
Let H be a set of functions from |U| to {0, 1, …, n-1}.
Definition: H is universal if ∀ x, y ∈ U, s.t. x ≠ y,
if h is chosen uniformly at random from H then:
Pr(h(x) = h(y)) ≤ 1/n
I.e., the fraction of hash functions of H that make
x & y collide is at most 1/n
Why 1/n?
“As if we were independently mapping x, y to
buckets (& uniformly at random).”U: universem = size of Sn = # buckets
29
2. Universality => Operations Are O(1)
Let H be a universal family of hash functions from |U|
to {0, 1, …, n-1}.
Recall m = O(n)
Claim: If h is picked randomly from H => for any
data set S, hash table operations are O(1).
U: universem = size of Sn = # buckets
30
2. Universality => Operations Are O(1)
Proof:
U: universem = size of Sn = # buckets
Hash TableS
Pick h randomly from H
0
1
..
n-2
n-1
…e3
e7
e5
e1 e4
e9 e27
A new element x arrives. Say we want to perform
Lookup(x).
Cost: O(# elements in bucket h(x)).
This quantity is a random variable. Call it
Z.
31
2. Universality => Operations Are O(1)
Proof Continued: Z=# elements in bucket h(x).
For each element y ∈ S, let Xy be 1 if h(y) = h(x).
U: universem = size of Sn = # buckets
1 is in case x is already there
Q.E.D
32
3. Universal Families of HF Exist (1)
Let n=2b, |U|=2t and t>b
Represent each x as t bit binary vector
Ex: |U|=27=128, hash table has size 24=16
|U| = 2t n = 2b
0 1 1 0 1 0 1
1 0 0 1 1 0 0
1 1 1 0 0 0 1
0 0 1 1 0 0 0
0
1
1
0
1
0
0
1
1
0
0
M x h(x)=Mx
Random 0/1 b x t matrix
=
multiplication mod 2
bucket 12
element52
3. Universal Families of HF Exist (2)
33
h(x): Mx: 2t -> 2b or U -> {0, 1, …, n-
1}
H = All possible b x t 0/1 random
matrices0 1 1 0 1 0 1
1 0 0 1 1 0 0
1 1 1 0 0 0 1
0 0 1 1 0 0 0
0
1
1
0
1
0
0
1
1
0
0
M x h(x)=Mx
Random 0/1 b x t matrix
=
multiplication mod 2|U| = 2t
n = 2b
Proof that H is Universal (1)
34
Need to prove that ∀ x ≠ y, Pr(h(x) = h(y)) ≤ 1/n =
1/2b, when M is picked uniformly at random from H .
=> equivalently when each cell of M is picked
randomly.
0 1 1 0 1 0 1
1 0 0 1 1 0 0
1 1 1 0 0 0 1
0 0 1 1 0 0 0
0
1
1
0
1
0
0
1
1
0
0
M x h(x)
=
|U| = 2t n = 2b
Proof that H is Universal (2)
35
x, y differ in at least one bit (say w.l.o.g., the last
bit)
let z = x-yz1
z2
z3
z4
z5
…
1
0
0
0
0
M z Mz
=
0 1 1 0 1 0 1
1 0 0 1 1 0 0
1 1 1 0 0 0 1
0 0 1 1 0 0 0
Q: Pr(Mz =0)?
|U| = 2t n = 2b
Proof that H is Universal (3)
36
Pr(Mz=0) = Pr(Mz[0]=0 & Mz[1]=0 & … Mz[b] = 0)
**Event Mz[i]=0 is independent from Mz[j]=0 since, the
coin flips for Mz[i] are independent from the coin flips for
Mz[j]**
Pr(Mz = 0) = Pr(Mz[0]=0) Pr(Mz[1)=0) …
Pr(Mz[b]=0)z1
z2
z3
z4
z5
…
1
0
0
0
0
M z Mz
=
0 1 1 0 1 0 1
1 0 0 1 1 0 0
1 1 1 0 0 0 1
0 0 1 1 0 0 0
Q: Pr(Mz[i] =0)?|U| = 2t n = 2b
Proof that H is Universal (4)
37
Pr(Mz[i]=0)
z1
z2
z3
z4
z5
…
1
0
0
0
0
M z Mz
=
0 1 1 0 1 0 1
1 0 0 1 1 0 0
1 1 1 0 0 0 1
0 0 1 1 0 0 0
Mz[i] = mi1z1 + mi2z2 + … 1*mit
Let y be the (modulo 2) sum of the first t-1
multiplications, Mz[i] = 0 iff mit is equal to ¬y!
i
Proof that H is Universal (5)
38
Pr(Mz[i] =0) = 1/2 Pr(Mx=My)=Pr(Mz = 0) = 1/2b = 1/n
Irrespective of the fist t-1 coin flips, it all
depends on the last coin flip.
|U| = 2t n = 2b
Q.E.D
Storing and Evaluating Hash Function h (M)
39
Q: How much space do we need to store the
random matrix M?
A: bt bits = O(log|U|log(n))
How much time to evaluate Mz?
A: Naïve: bt2=O(log|U|log(n))
Summary: H is a relatively fast, and
practical universal family of hash functions
Another Possible Family
40
We’re hashing from U -> {0, 1, …, n-1}
Let H be the set of all such functions
Question: Is H universal?
Another Possible Family
41
We’re hashing from U -> {0, 1, …, n-1}
Q1: # such functions?
A1: nU
Q2: # functions in which h(x) = h(y)=j?
A2: nU-2
Q3: # functions in which h(x) = h(y)?
A3: nnU-2 = nU-1
Q4: Pr(h(x) = h(y)?Answer: 1/n => H is
universal!
Why is H Impractical?
42
There are nU functions in H .
What’s cost of storing a function h fromH?
log(|H|)=O(Ulog(n)!
Not Practical!
43
Summary
1. Hash Tables
2. Defined Universal Family of Hash Functions
3. Universal family => Hash Table ops are expected O(1)
time
4. Universal families exist
44
Outline For Today
1. Hash Tables and Universal Hashing
2. Bloom Filters
45
Bloom FiltersRandomized Data Structure
Implementing a limited version of Dictionary ADT Insert Lookup
Compared to Hash Tables:
Applications
Website caches for ISPs
Cons
No Deletes
Not Always Correct Output to Lookup(x) => false positives
Pros
More Space Efficient
no pointers to actual objects
inserted
46
Same Setup As Hash TablesUniverse U of all possible elements
All possible 232 IP addresses
Maintain a subset S ⊆ U
|S|=m and |U| >> m
47
Bloom Filters
A Bloom Filter consists of:
A bit array of size n initially all 0 (not
buckets)
k hash functions h1, …, hk
Space cost per element= n/m0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
48
Insertions
Insert(a): set all hi(a) to 1 => O(k)
Let k = 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0
h1(x)=2, h2(x)=9
h3(x)=0
h1(y)=1, h2(y)=5
h3(y)=91 1 1 0 0 1 0 0 0 1 0 0 0 0 0 0
h1(z)=10,
h2(z)=11
h3(z)=5
1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 0
Do you see why there would be false positives?
49
Lookup
Lookup(a): return true if all hi(a) = 1 => O(k)
x: h1(x)=2, h2(x)=9 h3(x)=0 => Lookup(x) = true
z: h1(z)=3, h2(z)=9 h3(z)=4 => Lookup(z) = false
t: h1(t)=0, h2(t)=1 h3(t)=2 => Lookup(t) = true
1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
50
Can Bloom Filters Be Useful?
Can Bloom Filters be both space efficient
and have a low false positive rate?
What is the probability of false positives as
a function of n, m and k?
51
Probability of False Positive
We have inserted m elements to the bloom
filter.
New element z arrives, not been inserted
before.
Q: What’s the Pr(false positive for z)?
Assume h1(z) = j1, …, hk(z) = jk
**Simplifying (Unjustified) Assumption**: All
hashing is totally random!
∀hi, ∀x, hi(x) is uniformly random from {1, …, m}
and independent from all hj(y) for all y.
Warning: To simplify analysis. Won’t hold in
practice.
52
Pr(bit j is 1 after m insertions)?
Consider a particular bit j in the array.
Q1: Fix hi and an element x. Pr(hi(x) turns j to 1)?
A1: 1/n
Q2: Pr(x turns j to 1)? (Prob. one of h1(x), …, hk(x)
= j?)
A2: 1-Pr(x does not turn j to 1)= 1- (1-1/n)k
Q3: Pr(Bit j = 1 after m insertions)?
A3: 1-Pr(no element turns j to 1)= 1 – (1-1/n)km
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
53
Pr(false positive for x)?
Recall for x we check k bits: h1(x) = j1, …, hk(x) =
jk
Pr(bit ji = 1) = 1 – (1-1/n)km
Pr(false positive) = Pr(all ji = 1)= (1 –
(1-1/n)km)k
Recall Calculus fact: (1+x) ≤ ex
From the same fact: around x=0, (1+x) ≈ ex
Pr(false positive) ≈
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
54
How Does Failure Rate Change With k,n?
Observation 1: as n increases failure rate
decreases.
Observation 2: as k increases(the # hash
functions)
more bits to check => less likely to fail
more bits/object => more likely to fail
unclear if it increases or decreases
Question: What’s the optimal k for fixed n/m?
Answer (by taking derivatives): k=ln(2)n/m =
0.69*n/m
Failure rate
=
55
How Does Failure Rate Change With k,n?For fixed n/m, with optimal k=ln(2)n/m
Failure rate:
Already at n=8m, rate is 1-2%.
Exponentially decrease with n/m.
56
Next Week Dynamic Programming