35
Bloom lters Ben Langmead Department of Computer Science Please sign guestbook (www.langmead-lab.org/teaching-materials) to tell me briey how you are using the slides. For original Keynote les, email me ([email protected]).

115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filtersBen Langmead

Department of Computer Science

Please sign guestbook (www.langmead-lab.org/teaching-materials) to tell me briefly how you are using the slides. For original Keynote files, email me ([email protected]).

Page 2: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

Imagine we start with a hash table ...

... and start taking things away

Pointer

Null Pointer

Key

Value

Page 3: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

Pointer

Null Pointer

Key

Value

Imagine we start with a hash table ...

... and start taking things away

Take away values

Now we have a hash set

Page 4: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

Pointer

Null Pointer

Key

Value

Imagine we start with a hash table ...

... and start taking things away

Take away values

Now we have a hash set

Now take away keys

Page 5: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

10000010010000001000

Page 6: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

10000010010000001000

Page 7: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter: inserting

00000000000000000000

Start with all 0s

Page 8: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter: inserting

00010000000000000000

h1(x1)

To insert item, hash it to a bucket and set bit to 1 (if not already)

Page 9: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

00010001000010000000

h1(x1)

h1(x2)

h1(x3)

Bloom filter: inserting

Page 10: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

00010001000010000000

h1(x1)

h1(x2)

h1(x3)h1(x4)

Bloom filter: inserting

If already 1, this is a collisionLife goes on

Page 11: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter: querying

01010001011010110101

Use same hash function to hash query item; check if

h1

bucket is 0 or 1

Page 12: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter: querying

01010001011010110101

h1(y)

If filter says "0" -- item is definitely not present

Page 13: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

01010001011010110101

h1(z)

Bloom filter: querying

If filter says "0" -- item is definitely not present

If filter says "1" -- item may be present

One-sided error

Randomized algorithms can exploit one-sided error via repeated trials

...or may be a collision

Page 14: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

01010001011010110101

h1

00010001010011100100

h2

01110011011010110101

h3

01111100010100111111

hk...

Will all filters have the same number of set (=1) bits?

Not necessarily; one hash/filter

might have more/fewer

collisions than another

What if we used many hashes/filters; adding each item to each?

Page 15: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

01010001011010110101

00010001010011100100

01110011011010110101

01111100010100111111

... h{1,2,3,...,k}(y)

Bloom filter

Page 16: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

01010001011010110101

00010001010011100100

01110011011010110101

01111100010100111111

... h{1,2,3,...,k}(y)

If any query yields 0, item is not in the set

Page 17: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

01010001011010110101

00010001010011100100

01110011011010110101

01111100010100111111

... h{1,2,3,...,k}(z)

If all queries yield 1, item may be in the set; or we might have collided k times

Bloom filter

Page 18: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

01111100010100111111

Say item is not present; chance that filters all return 1 is ...

k

: fraction of bits setp1 : number of hash functionsk

pk1

...increasing reduces error exponentiallyk

Bloom filter

Say for all filters; 10 filters give

collective error rate of

p1 = 50 %

( 12 )

10

< 0.1 %

Page 19: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

Rather than use a new filter for each hash, one filter can use hashesk

00000000000000000000

Page 20: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

Rather than use a new filter for each hash, one filter can use hashes ( here)k k = 4

01001000000100001000

h{1,2,3,4}(x)

Insert

set

set

set

set

Page 21: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter01001000000100001000

h{1,2,3,4}(y)h{1,2,3,4}(x)

Insert Query

?

?

??set

set

set

set

Rather than use a new filter for each hash, one filter can use hashes ( here)k k = 4

Page 22: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

Say filter has bits and uses hash functions

nk

What's the probability a given bit is 0 after items have been inserted?

m (1 −1n )

mk

Assume hashes are well behaved, i.e. uniform and independent

h{1,2,3,...,k}

n

Page 23: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filterh{1,2,3,...,k}

n

Probability a given bit is 0 after items are inserted

mp0 = (1 −

1n )

mk

What's the probability of a false positive? Where all hash functions find a 1?

k

pk1 = (1 − p0)k

= (1−(1 −1n )

mk

)k

p1 = 1 − p0

Call this f

Page 24: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom analysis: road ahead

01001000000100001000

h{1,2,3,4}(x)

1. Approximations in terms of

2. Choosing (# hashes) for fixed

3. Choosing

4. Analyzing fullness with Balls & Bins

e

k m, n

m/n

p0 = (1 −1n )

mk

f = (1−(1 −1n )

mk

)k

= (1 − p0)k = pk1

Page 25: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

Rewrite in terms of : (1 −1n )

mk

e

(1 −1n )

mk

= eln[(1 − 1

n )mk]

= eln(1 − 1n )mk

Can we approximate ?ln (1 −1n )

Page 26: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

Taylor expansion of ln(1 + x)

x −x2

2+

x3

3− . . .

If is small (e.g. 0.001), terms beyond first are really small (<1 millionth)

x

ln (1 −1n ) ≈ −

1n e ln(1 − 1

n )mk ≈ e− mkn

:

"Mercator series"

Page 27: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

p0 = (1 −1n )

mk

≈ e−mk

n = p̃0

f = (1 − p0)k ≈ (1−e−mk

n )k

= (1 − p̃0)k

Approximation using Taylor expansion of ln(1 + x)

e− mn ⋅k Where = # data items per slot

mn

Page 28: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

How do we choose # of hash functions ?k

Can we have too many?

Yes, quickly clogging the filter with 1s

Yes, failing to get much exponential decrease in error

itemsm

n

pk1pk

1 is too close to 1p small , not much

compoundingk

Too few?

Page 29: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

......

k

f

We know that f.p. rate gets large for small and large . To find minimum, find where derivative is 0

(1 − emn ⋅−k)

k

k

Fixed mn

Page 30: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

ddk [ k ln (1 − e

−mkn ) ]

At top level, use product rule

ddk [k] ⋅ ln (1 − e

−mkn ) = ln (1 − e

−mkn ) k ⋅

ddk [ln (1 − e

−mkn )]

Need chain rule in here

+

is tricky, so log the function firstddk (1 − e

−mkn )

k

Page 31: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

ddk [ k ln (1 − e

−mkn ) ] = ln (1−e−km/n)+

kmn

⋅e−km/n

1−e−km/n

Derivative is zero when k = ln 2 ⋅ n/m

e−km/n = eln 2⋅n/m⋅−m/n

= e−ln 2 =12

kmn

= ln 2 ⋅ n/m ⋅ m/n

= ln 2

Page 32: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

ddk [ k ln (1 − e

−mkn ) ] = ln (1−e−km/n)+

kmn

⋅e−km/n

1−e−km/n

Derivative is zero when k = ln 2 ⋅ n/m

Best choice of : k k* = ln 2 ⋅ n/m

= ln (1−1/2)+ln 2 ⋅1/2

1−1/2= ln 1/2 + ln 2= − ln 2 + ln 2 = 0

e−km/n =12

kmn

= ln 2

Page 33: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

h

Tradeoff for M/N=10

FPR

k

n/m = 10

(1 − e−mk

n )k

k* = ln 2 ⋅ 10 = 6.93

Page 34: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

01010101011010110101

If we pick ideal (# hashes) for fixed , what fraction of the filter do we expect to be set bits?

k m, n

Page 35: 115 bloom filters pub - Department of Computer Sciencelangmea/resources/lecture_notes/115_bloom_filter… · Bloom !lter Say !lter has bits and uses hash functions n k What's the

Bloom filter

With chosen optimally, filter is 50% set bits

k

e−mk/n = p̃0

−mkn

= ln p̃0

k = − ln p̃0 ⋅nm

Best choice of : k

k* = ln 2 ⋅nm

−ln p̃0 ⋅nm

= ln 2 ⋅nm

ln p̃0 = − ln 2

p̃0 =12