20
Hash Functions FTW * Fast Hashing, Bloom Filters & Hash-Oriented Storage Sunny Gleason * For the win (see urbandictionary FTW[1]); this expression has nothing to do with hash functions

Hash Functions FTW

Embed Size (px)

DESCRIPTION

Presentation on Hash Functions, Bloom Filters, and Hash-Oriented Storage

Citation preview

Page 1: Hash Functions FTW

Hash Functions FTW*

Fast Hashing, Bloom Filters & Hash-Oriented Storage

Sunny Gleason

* For the win (see urbandictionary FTW[1]); this expression has nothing to do with hash functions

Page 2: Hash Functions FTW

What’s in this Presentation

• Hash Function Survey

• Hash Performance

• Bloom Filters

• HashFile : Hash Storage

Page 3: Hash Functions FTW

Hash Functions

int getIntHash(byte[] data); // 32-bitlong getLongHash(byte[] data) // 64-bit

int v1 = hash(“foo”); int v2 = hash(“goo”);

int hash(byte[] value) { // a simple hash int h = 0; for (byte b: value) { h = (h<<5) ^ (h>>27) ^ b; } return h % PRIME;}

Page 4: Hash Functions FTW

Hash Functions

• Goal : v1 has many bit differences from v2

• Desirable Properties:

• Uniform Distribution - no collisions

• Very Fast Computation

Page 5: Hash Functions FTW

Hash Applications

Goal: O(1) access

• Hash Table

• Hash Set

• Bloom Filter

Page 6: Hash Functions FTW

Popular Hash Functions

• FNV Hash

• DJB Hash

• Jenkins Hash

• Murmur2

• New (Promising?): CrapWow

• Awesome & Slow: SHA-1, MD5 etc.

Page 8: Hash Functions FTW

A Strawman “Set”• N keys, K bytes per key

• Allocate array of size K * N bytes

• Utilize array storage as:

• a heap or tree: O(lg N) insert/delete/remove

• a hash: O(1) insert/delete/remove

• What if we don’t have room for K*N bytes?

Page 9: Hash Functions FTW

Bloom Filter• Key Point: give up on storing all the keys

• Store r bits per key instead of K bytes

• Allocate bit vector of size: M = r * N, where N is expected number of entries

• Use multiple hash functions of key to determine which bits to set

• Premise: if hash functions are well-distributed, few collisions, high accuracy

Page 10: Hash Functions FTW

Bloom Filter

Page 11: Hash Functions FTW

Tuning Bloom FiltersLet r = M bits / N keys (r: num bits/key)

Let k = 0.7 * r (k: num hashes to use)

Let p = 0.6185 ** r (p: probability of false positives)

Working backwards, we can use desired false positive rate p to tune the data structure space consumption:

r = 8, p = 2.1e-2 r = 16, p = 4.5e-4r = 24, p = 9.8e-6 r = 32, p = 2.1e-7r = 40, p = 4.5e-9 r = 48, p = 9.6e-11

Page 12: Hash Functions FTW

Bloom Filter Performance

100MM entries, 8bits/key : 833k ops/s100MM entries, 32bits/key : 256k ops/s1BN entries, 8bits/key : 714k ops/s1BN entries, 32bits/key : 185k ops/s

Hypothesis : difference between 100MM and 1BN is due to locality of memory access in smaller bit vector

Page 13: Hash Functions FTW

Hash-Oriented Storage

• HashFile : 64-bit clone of djb’s constant db “CDB”

• Plain ol’ Key/Value storage: add(byte[] k, byte[] v), byte[] lookup(byte[] k)

• Constant aka “Immutable” Data Store

create(), add(k, v) ... , build() ... before lookup(k)

• Use properties of hash table to achieveO(1) disk seeks per lookup

Page 14: Hash Functions FTW

HashFile Structure• Header (fixed width): table pointers,

contains offests of hash tables and count of elements per table

• Body (variable width): contains concatenation of all keys and values (with data lengths)

• Footer (fixed width): hash “tables” containing long hash values of keys alongside long offsets into body

Page 15: Hash Functions FTW

HashFile Diagram

• Create: initialize empty header, start appending keys/values while recording offsets and hash values of keys

• Build: take list of hash values and offsets and turn them into hash tables, backfill header with values

• Lookup: compute hash(key), compute offset into table (hash modulo size of table), use table to find offset into body, return the value from body

HEADERp1s3p2s4p3s2p4s1

BODYk1v1k2v2k3v3k4v4k5v5k6v6k7v7

FOOTERhk7o7hk3o3hk4o4hk1o1

Page 16: Hash Functions FTW

HashFile Performance• Spec: ≤ 2 disk seeks per lookup

• Number of seeks independent of number of entries

• X25E SSD: 1BN 8-byte keys, values (41GB):650μs lookup w/ cold cache, up to 700x faster as filesystem cache warms, 0.9μs when in-memory

• With 100MM entries (4GB), cold cache is ~600μs (from locality), 0.6μs warm

Page 17: Hash Functions FTW

Conclusions

• Be aware of different Hash Functions and their collision / performance tradeoffs

• Bloom Filters are extremely useful for fast, large-scale set membership

• HashFile provides excellent performance in cases where a static K/V store suffices

Page 18: Hash Functions FTW

Future Work

• Implement cWow hash in Java

• Extend HashFile with configurable hash, pointer, and key/value lengths to conserve space (reduce 24 bytes-per-KV overhead)

• Implement a read-write (non-constant) version of HashFile

• Bloom Filter that spills to SSD

Page 19: Hash Functions FTW

Thank You!...Any questions? :)