Hash Functions FTW

Hash Functions FTW*

Fast Hashing, Bloom Filters & Hash-Oriented Storage

Sunny Gleason

* For the win (see urbandictionary FTW[1]); this expression has nothing to do with hash functions

What’s in this Presentation

• Hash Function Survey

• Hash Performance

• Bloom Filters

• HashFile : Hash Storage

Hash Functions

int getIntHash(byte[] data); // 32-bitlong getLongHash(byte[] data) // 64-bit

int v1 = hash(“foo”); int v2 = hash(“goo”);

int hash(byte[] value) { // a simple hash int h = 0; for (byte b: value) { h = (h<<5) ^ (h>>27) ^ b; } return h % PRIME;}

Hash Functions

• Goal : v1 has many bit differences from v2

• Desirable Properties:

• Uniform Distribution - no collisions

• Very Fast Computation

Hash Applications

Goal: O(1) access

• Hash Table

• Hash Set

• Bloom Filter

Popular Hash Functions

• FNV Hash

• DJB Hash

• Jenkins Hash

• Murmur2

• New (Promising?): CrapWow

• Awesome & Slow: SHA-1, MD5 etc.

Evaluating Hash Functions

• Hash Function “Zoo”

• Quality of: CRC32 DJB Jenkins FNVMurmur2 SHA1

• Performance:(MM ops/s)

!"

#"

$!"

$#"

%!"

%#"

&!"

&#"

'!"

'#"

%#(" ('" )"

!"#$%&'()*(+",-'%./%0'/%1',23$%

*+,-.,/"

012312%"

456$"

http://www.team5150.com/~andrew/noncryptohashzoo/


http://www.team5150.com/~andrew/noncryptohashzoo/CRC32-avalanche.html

http://www.team5150.com/~andrew/noncryptohashzoo/CRC32-avalanche.html

http://www.team5150.com/~andrew/noncryptohashzoo/DJB-avalanche.html

http://www.team5150.com/~andrew/noncryptohashzoo/DJB-avalanche.html

http://www.team5150.com/~andrew/noncryptohashzoo/lookup3-avalanche.html

http://www.team5150.com/~andrew/noncryptohashzoo/lookup3-avalanche.html

http://www.team5150.com/~andrew/noncryptohashzoo/FNV-avalanche.html

http://www.team5150.com/~andrew/noncryptohashzoo/FNV-avalanche.html

http://www.team5150.com/~andrew/noncryptohashzoo/Murmur2-avalanche.html

http://www.team5150.com/~andrew/noncryptohashzoo/Murmur2-avalanche.html

http://www.team5150.com/~andrew/noncryptohashzoo/SHA1-avalanche.html

http://www.team5150.com/~andrew/noncryptohashzoo/SHA1-avalanche.html

A Strawman “Set”• N keys, K bytes per key

• Allocate array of size K * N bytes

• Utilize array storage as:

• a heap or tree: O(lg N) insert/delete/remove

• a hash: O(1) insert/delete/remove

• What if we don’t have room for K*N bytes?

Bloom Filter• Key Point: give up on storing all the keys

• Store r bits per key instead of K bytes

• Allocate bit vector of size: M = r * N, where N is expected number of entries

• Use multiple hash functions of key to determine which bits to set

• Premise: if hash functions are well-distributed, few collisions, high accuracy

Bloom Filter

Tuning Bloom FiltersLet r = M bits / N keys (r: num bits/key)

Let k = 0.7 * r (k: num hashes to use)

Let p = 0.6185 ** r (p: probability of false positives)

Working backwards, we can use desired false positive rate p to tune the data structure space consumption:

r = 8, p = 2.1e-2 r = 16, p = 4.5e-4r = 24, p = 9.8e-6 r = 32, p = 2.1e-7r = 40, p = 4.5e-9 r = 48, p = 9.6e-11

Bloom Filter Performance

100MM entries, 8bits/key : 833k ops/s100MM entries, 32bits/key : 256k ops/s1BN entries, 8bits/key : 714k ops/s1BN entries, 32bits/key : 185k ops/s

Hypothesis : difference between 100MM and 1BN is due to locality of memory access in smaller bit vector

Hash-Oriented Storage

• HashFile : 64-bit clone of djb’s constant db “CDB”

• Plain ol’ Key/Value storage: add(byte[] k, byte[] v), byte[] lookup(byte[] k)

• Constant aka “Immutable” Data Store

create(), add(k, v) ... , build() ... before lookup(k)

• Use properties of hash table to achieveO(1) disk seeks per lookup

http://github.com/sunnygleason/g414-hash/

http://github.com/sunnygleason/g414-hash/

HashFile Structure• Header (fixed width): table pointers,

contains offests of hash tables and count of elements per table

• Body (variable width): contains concatenation of all keys and values (with data lengths)

• Footer (fixed width): hash “tables” containing long hash values of keys alongside long offsets into body

HashFile Diagram

• Create: initialize empty header, start appending keys/values while recording offsets and hash values of keys

• Build: take list of hash values and offsets and turn them into hash tables, backfill header with values

• Lookup: compute hash(key), compute offset into table (hash modulo size of table), use table to find offset into body, return the value from body

HEADERp1s3p2s4p3s2p4s1

BODYk1v1k2v2k3v3k4v4k5v5k6v6k7v7

FOOTERhk7o7hk3o3hk4o4hk1o1

HashFile Performance• Spec: ≤ 2 disk seeks per lookup

• Number of seeks independent of number of entries

• X25E SSD: 1BN 8-byte keys, values (41GB):650μs lookup w/ cold cache, up to 700x faster as filesystem cache warms, 0.9μs when in-memory

• With 100MM entries (4GB), cold cache is ~600μs (from locality), 0.6μs warm

Conclusions

• Be aware of different Hash Functions and their collision / performance tradeoffs

• Bloom Filters are extremely useful for fast, large-scale set membership

• HashFile provides excellent performance in cases where a static K/V store suffices

Future Work

• Implement cWow hash in Java

• Extend HashFile with configurable hash, pointer, and key/value lengths to conserve space (reduce 24 bytes-per-KV overhead)

• Implement a read-write (non-constant) version of HashFile

• Bloom Filter that spills to SSD

Thank You!...Any questions? :)

References

• GitHub Project: g414-hash (hash function, bloom filter, HashFile implementations)

• Wikipedia: Hash Function, Bloom Filter

• Non-Cryptographic Hash Function Zoo

• DJB CDB, sg-cdb (java implementation)

http://github.com/sunnygleason/g414-hash

http://github.com/sunnygleason/g414-hash

http://en.wikipedia.org/wiki/Hash_function

http://en.wikipedia.org/wiki/Hash_function

http://en.wikipedia.org/wiki/Bloom_filter

http://en.wikipedia.org/wiki/Bloom_filter



http://cr.yp.to/cdb.html

http://cr.yp.to/cdb.html

http://www.strangegizmo.com/products/sg-cdb/

http://www.strangegizmo.com/products/sg-cdb/

Technology

Hash Functions FTW