View
31
Download
0
Category
Preview:
DESCRIPTION
Filtry Blooma jako oszczędna struktura słownikowa. Szymon Grabowski sgrabow@kis.p.lodz.pl Instytut Informatyki Stosowanej, Politechnika Łódzka. kwiecień 2013. chained hashing. open addressing. The idea of hashing. - PowerPoint PPT Presentation
Citation preview
kwiecień 2013
Filtry Blooma jako
oszczędna struktura słownikowa
Szymon Grabowskisgrabow@kis.p.lodz.pl
Instytut Informatyki Stosowanej, Politechnika Łódzka
Hash table: store keys (and possibly satellite data), the location is found via a hash function; some collision resolution method needed.
The idea of hashing
chained hashing open addressing
3
…but have a few drawbacks:are randomized,
don’t allow for iteration in a sorted order,and may require quite some space.
Now, if we require possibly small space, what can we do?
Hash structures are typically fast…
4
Don’t store the keys themselves!
Bloom Filter (Bloom, 1970)
Just be able to answer if a given key is in the structure.If the answer if “no”, it is correct.
But if the answer is “yes”, it may be wrong!
So, it’s a probabilistic data structure.
There’s a tradeoff between its space(avg space per inserted key) and “truthfulness”.
5
Bloom Filter features• little space, so save on RAM (mostly old apps);
• little space, so also fast to transfer the structure over network;
• little space, sometimes small enough to fit L2 (or L1!) CPU cache (Li & Zhong, 2006: BF makes a Bayesian spam filter work
much faster thx to fitting an L2 cache);
• extremely simple / easy to implement;
• major application domains: databases, networking;
• …but also a few drawbacks / issues, hence a significantinterest in devising novel BF variants
6
The idea:
• keep a bit-vector of some size m, initially all zeros;
• use k independent hash functions (h.f.) (instead of one, in a standard HT) for each added key;
• write 1 in the k locations pointed by the k h.f.;
• testing for a key: if in all k calculated locations there is 1,then return “yes” (=the key exists), which may be wrong,
if among the k locations there’s at least one 0, return “no”, which is always correct.
Bloom Filter idea
7
BF, basic API
insert(k)
exists(k)
No delete(k)!
And of course: no iteration over the keysadded to the BF (no content listing).
http://www.cl.cam.ac.uk/research/srg/opera/meetings/attachments/2008-10-14-BloomFiltersSurvey.pdf
8
Early applications – spellchecking (1982, 1990), hyphenation
If a spellcheck occasionally ignoresa word not in its dictionary – not a big problem.
This is exactly the case with BF in this app.
Quite a good app: the dictionary is static (or almost static), so once we set the BF size,
we can estimate the error,which practically doesn’t change.
App from Bloom’s paper: program for automatic hyphenation in which 90% of words can be hyphenated using simple rules,
but 10% require dictionary lookup.
9
Bloom speaking…
http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=7FB6933B782FBC9C98BBCDA0EB420935?doi=10.1.1.20.2080&rep=rep1&type=pdf
10
BF tradeoffs
The error grows with load (i.e. with growing n / m, n is the # of added items).
When the BF is almost empty, the error is very small,but then we also waste lots of space.
Another factor: k. How to choose it?
For any specified load (m set to the ‘expected’ nin a given scenario) there is an optimal value of k
(such that minimizes the error).
k too small – too many collisions;k too large – the bit vector gets too ‘dense’ quickly
(and too many collisions, too!)
11
Finding the best k
We assume the hash functionschoose each bit vector slot with equal prob.
Pr(a given bit NOT set by a given h.f.) = 1 – 1/m
Pr(a given bit NOT set) = (1 – 1/m)kn
m and kn are typically large, so Pr(a given bit NOT set) e–kn / m
Pr(a given bit is set) = 1 – (1 – 1/m)kn
Consider an element not added to the BF:the filter will lie if all the corresponding k bits are set.
This is: Pr(a given bit is set)k = (1 – (1 – 1/m)kn)k (1 – e–kn / m)k
12
Finding the best k, cont’d
Differentiation (=calculating a derivative) helps.
The error is minimized for k = ln 2 * m / n 0.693 m / n.(Then the # of 1s and 0s in the bit-vector is
approx. equal. Of course, k must be an integer!)
And the error (false positive rate, FPR) = (1/2)k (0.6185)m / n.
Again, Pr(the Bloom filter lies) (1 – e–kn / m)k.
Clearly, the error grows with growing n (for fixed k, m)and decreases with growing m (for fixed k, n).
What is the optimal k?
13
Minimizing the error in practice
m = 8n error 0.0214m = 12n error 0.0031m = 16n error 0.0005
http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html#SECTION00053000000000000000
14
www.eecs.harvard.edu/~michaelm/TALKS/NewZealandBF.ppt
FPR example, m / n = 8
15
Funny tricks with BF
Given two BFs, representing sets S1 and S2,with the same # of bits and using the same hash functions,
we can represent the union of those sets by taking the OR of the two bit-vectors of the original BFs.
Say you want to halve the memory use after some timeand assume the filter size is a power of 2.
Just OR the halves of the filter.When hashing for a lookup, OR the lower and upper bits
of the hash value.
Intersection of two BFs (of the same size), i.e. AND operation, can be used to approximate
the intersection of two sets.
16
Solution: when the filter gets ‘full’ (reaches the limit on the fill ratio), a new one is added, with tighter
max FPR, and querying translates totesting at most all of those filters…
Scalable BF (Almeida et al., 2007)
We can find the optimal k knowing n / m in advance.As m is settled once, we must know (roughly) n,
the number of items to add.What if we have a pale idea of the size of n..?
If the initial m is too large, we may halve it easily(see prev. slide). Crude, but possible.
What about m being too small?
17
How to approximate a set without knowing its size in advance
ε – max allowed false positive rate
Classic result: BF (and some other related structures) offers (n log(1/ε))-bit solution, when n is known in advance.
Pagh, Segev, Wider (2013):
18
Semi-join operation in a distributed database
Empl Salary Addr City
John 60K … New York
George 30K … New York
Moe 25K … Topeka
Alice 70K … Chicago
Raul 30K Chicago
City Cost of living
New York 60K
Chicago 55K
Topeka 30K
Task: Create a table of all employees that make < 40K and live in city where COL > 50K.
Empl Salary Addr City COL
Semi-join: send (from A to B) just (City)
database A database B
Anything better?www.eecs.harvard.edu/~michaelm/TALKS/NewZealandBF.ppt
19
• BF-based solution: A sends a Bloom filterinstead of actual city names,
• …then B sends back its answers…
• …from which A filters out the false positives
This is to minimize transfer (over a network) between the database sites! The CPU work is increased:
B needs to filter its city list using the received filter, A needs to filter its received list of persons.
Bloom-join
20
P2P keyword search(Reynolds & Vadhat, 2003)
• distributed inverted index on words, multi-word queries,
• Peer A holds list of document IDs containing Word1, Peer B holds list for Word2,
• intersection needed, but minimize communication,
• A sends B a Bloom filter of document list,
• B sends back possible intersections to A,
• A verifies and sends the true result to user,
• i.e. equivalent to Bloom-join
21
Web Cache 1 Web Cache 2 Web Cache 3
The WebThe Web
Distributed Web caches (Fan et al., 2000)
www.eecs.harvard.edu/~michaelm/TALKS/NewZealandBF.ppt
22
k-mer counting in bioinformatics
http://www.homolog.us/blogs/wp-content/uploads/2011/07/i6.png
k-mers: substrings of length k in a DNA sequence.Counting them is important: for genome de novo assemblers
(based on a de Brujin graph),for detection of repeated sequences,
to study the mechanisms of sequence duplication in genomes, etc.
23
BFCounter algorithm(Melsted & Pritchard, 2011)
The considered problem variant:find all non-unique k-mers in reads collection, with their counts.
I.e. ignore k-mers with occ = 1 ( almost certainly noise).
Input data: 2.66G 36bp Illumina reads (40-fold coverage).
(Output) statistics:12.18G k-mers (k = 25) present in the sequencing reads,
of which 9.35G are unique and 2.83G have coverage of two or greater.
24
BFCounter idea(Melsted & Pritchard, 2011)
Both a Bloom filter and a plain hash table used.
Bloom filter B used to store implicitly all k-mers seen so far, while only inserting non-unique k-mers into the hash table T.
For each k-mer x, we check if x is in B.If not, we update the appropriate bits in B, to indicate that it has now been observed.
If x is in B, then we check if it is in T, and if not, we add it to T (with freq = 2).
What about false positives?
25
BFCounter idea, cont’d(Melsted & Pritchard, 2011)
After the first pass throughthe sequence data, one can re-iterate
over the sequence data to obtain exact k-mer counts in T (and then delete all unique k-mers).
Extra time for this second round: at most 50% of the total time, And tends to be less since hash table lookups are generally
faster than insertions.
Approximate version possible: no re-iteration(i.e. coverage counts for some k-mers will be higher by 1
than their true value).
26
Memory usage for chr21 (Melsted & Pritchard, 2011)
27
BF, cache access
Negative answer: ½ chance that the first probed bit is 0, then we terminate (i.e., 1 cache miss – in rare cases 0).
On avg with a negative answer: (almost) 2 cache misses. Good (and hard to improve).
Positive answer: (almost) k misses on avg.
A problem not really addressed until quite recently…
28
Blocked Bloom filters (Putze et al., 2007, 2009)
The idea:
first h.f. determines the cache line (of typical size 64B = 512 bits nowadays),
the next k–1 h.f. are used to set or test bits (as usual)but only inside this one block.
I.e. (up to) one cache miss always!
Drawback: FPR slightly larger than with plain BF for the same c := m / n and k.
And the loss grows with growing c…(even if smaller k is chosen for large c,
which helps somewhat).
29
Blocked Bloom filters, cont’d (Putze et al., 2007, 2009)
I.e. if c < 20 (the top row), then the space grows usually by <20%
compared to the plain BF, with comparable FPR.
Unfortunately, for large c (rarely needed?) the loss is very significant.
The idea of blocking for BF was first suggested in (Manber & Wu, 1994), for storing the filter on disk.
30
Counting Bloom filter (Fan et al., 1998, 2000)
BF with delete:use small counters instead of single bits.
BF[pos]++ at insert, BF[pos]-- at del.
E.g. 4 bits: up to count 15.
Problem: counter overflow(plain solution: freeze the given counter).
Another (obvious) problem: more space, eg. 4 times.
4-bit counters and k < ln 2 (m / n) probability of overflow 1.37e–15 * m
31
CBF, another problem…
A deletion instruction for a false positive item(a.k.a. incorrect deletion of a false positive item)
may produce false negative items!
Problem widely discussed and analyzed in (Guo et al., 2010)
32
Deletable Bloom filter (DlBF) (Rotherberg et al., 2010)
Cute observation:those of the k bits for an item x which don’t have a collision
may be safely unset. If at least one of those k bits is such,
then we’ve managed to delete x!
How to distinguish colliding (overlapping) set bitsfrom non-colliding ones?
One extra bit per location? Quite costly…
33
Deletable Bloom filter, cont’d (Rotherberg et al., 2010)
Compromise solution:divide the bit-vector into small areas;
iff no collision in an area happen then mark it as a collision-free area.
34
DlBF, deletability prob. as a function of filter density
35
Compressed Bloom filter (Mitzenmacher, 2002)
If RAM is not an issue, but we want to transmit the filterover a network…
Mitzenmacher noticed it pays to use more space,incl. more 0 bits (i.e. the structure is more sparse),
as then the bit-vector becomes compressible.(In a plain BF the numbers of 0s and 1s are approx equal
practically incompressible.)
m / n increased from 16 to 48: after compressionapprox. the same size, but the FPR drops twice
36
Conclusions
Bloom Filter is alive and kicking!
Lots of applications and lots of new variants.
In theory: constant FP rate and constant number of bits per key.
In practice: always think what FP rate you can allow.Also: what the errors mean (erroneous results
or „only” increased processing time for false positives?).
Bottom line: succinct data structure for Big Data.
Recommended