DASH Hash Functions for storage and data management

DASHDASH

Hash Functions

for storage and data management

Great tools may be inappropriate for some tasks

In many situations, popular hash functions are used in situations they are not designed for

This blunder is often found in storage/data management

SynopsisSynopsis

“Enterprise storage needs will increase by a factor of seven over

the next three years.”- Strategic Research Corporation, 2002

An enterprise spends an average of three dollars managing storage for every one dollar spent on storage

hardware.” (Gartner quoted in “Emerging Technology: Keeping

Storage Costs Under Control”, Network Magazine, Oct. 5, 2002)

According to a study at University of California , Berkeley , more new

information is predicted to be stored in the next several years, than in all of previous recorded human history

combined. .

“(…) technology improvements in current magnetic and optical data

storage systems are saturating, (…) reaching their theoretically

achievable storage densities.”(Liz Murphy, Vice President of Marketing, InPhase

Technologies)

Storage Management: An increasing needStorage Management: An increasing need

Given the extreme growth of storage needs, storage management has become an imperative necessity…

… and thus the bread and butter of many immerging IT companies.

Storage management

Reduce storage space

Data Mining/

Warehousing

Reliable and efficient backup of

files

Duplicate detection

Mirroring/Synchronization

Compressed representation

Indexing

Hash Functions

Storage Management and Hash FunctionsStorage Management and Hash Functions

101110100010100101000011101000100010100101001011

010101001110001100101110100100111111110001010110

101000100001101001101000100100110010111010010011

111100000101010111001101001101000100101110110100

111000110010100111110000101000100101010010111010

101101001001001101010110010010110100101100101110

001011100010100100011010001011100100101110100010

010010111010101000101010010101011110000001110111

100000010111110111110000101011000000011111111111

10 0111

Hash values

Bit-streams

A hash function associates bit-streams to a small number of “short” hash values

What is a Hash Function?What is a Hash Function?

A hash value is essentially a digital fingerprint

What is a Hash Function?What is a Hash Function?

A hash function is an algorithm to automatically create the fingerprints

Hash functions are used in many applications, such as

• Duplicate detection• Mirroring/Synchronization• Indexing• Error-detecting and error-correcting• Privacy/Security• Many more applications…

But often the existing hash functions are used for the wrong purpose, or the scientific community has not yet produced any hash

functions designed for this particular purpose

The many uses of Hash FunctionsThe many uses of Hash Functions

• Most hash functions used in storage management applications are off-the-shelf hash functions that were designed for dissimilar, sometimes contrary purposes

• The adequacy of these hashes are erroneously assessed, given that this assessment is usually based on the probability that two random equiprobable bit-streams hash to the same value. Yet:– Bit-streams generated by computer applications are not random, but

have a definite statistical or deterministic structure– Collision probabilities do not scale linearly with the number of files

hashed, but exponentially! (Ask your mathematician about the “birthday paradox”)

• Inadequate hash functions may lead to slower processes, greater memory requirements, and sometimes complete disasters (loss of data, data corruption, etc.)

• Given the exponential growth of storage needs, current “good enough” hashes will become inadequate—if not disastrous—in the near future.

The need for better hash functionsThe need for better hash functions

101110100010100101000011101000100010100101001011

010101001110001100101110100100111111110001010110

101000100001101001101000100100110010111010010011

111100000101010111001101001101000100101110110100

111000110010100111110000101000100101010010111010

101101001001001101010110010010110100101100101110

001011100010100100011010001011100100101110100010

010010111010101000101010010101011110000001110111

100000010111110111110000101011000000011111111111

10 11

Few bits to represent hash values

Hash values fast to compute

00

We want our fingerprints to be small and fast to compute…

Desired properties of all Hash FunctionsDesired properties of all Hash Functions

…But further desired properties diverge depending on what we are using the fingerprints for.

Applied to…

All possible

bit-streams

“Catch”transmission

errors

Goals

(e.g. Check-sum, CRC, Reed-Solomon, etc.)

Error-Detecting and Correcting Hash FunctionsError-Detecting and Correcting Hash Functions

3. Fingerprint of received data is computed

101110100010100101000011101000100010100101001011

10

101110100010100101000011010100100010100101001011

10 11

1. Fingerprint of data is computed

2. Data and fingerprint are sent together

4. If received and computed fingerprints don’t match, we know an error occurred

These hash functions are designed to “catch” transmission errors.

Error-Detecting and Correcting Hash FunctionsError-Detecting and Correcting Hash Functions

(e.g. Check-sum, CRC, Reed-Solomon, etc.)

Applied to…

All possible

bit-streams

Privacy/Security

Goals

(e.g. MD5, SHA, GOST, RIPEMD etc.)

Cryptographic Hash FunctionsCryptographic Hash Functions

101110100010100101000011101000100010100101001011

010101001110001100101110100100111111110001010110

101000100001101001101000100100110010111010010011

111100000101010111001101001101000100101110110100

111000110010100111110000101000100101010010111010

101101001001001101010110010010110100101100101110

001011100010100100011010001011100100101110100010

010010111010101000101010010101011110000001110111

100000010111110111110000101011000000011111111111

10 0111

These hash functions are intended for security and privacy issues. They are designed so that given a fingerprint, it is unfeasible to create a bit-stream having this fingerprint.

Cryptographic Hash FunctionsCryptographic Hash Functions(e.g. MD5, SHA, GOST, RIPEMD etc.)

Applied to… Goals

Applicationgenerated

bit-streams

What storage management applications need: Differentiate bit-streams generated by computer applications

What storage management uses: Off-the-shelf hash functions

Consequence: Less efficiency, less effectiveness, more memory


errors

Privacy/Security

Differentiatebit-streams

What storage management needs from Hash FunctionsWhat storage management needs from Hash Functions

101110100010100101000011101000100010100101001011

010101001110001100101110100100111111110001010110

101000100001101001101000100100110010111010010011

111100000101010111001101001101000100101110110100

111000110010100111110000101000100101010010111010

101101001001001101010110010010110100101100101110

001011100010100100011010001011100100101110100010

010010111010101000101010010101011110000001110111

100000010111110111110000101011000000011111111111

10 0111

differentiation effectiveness ↔ collision probability

Goal in storage management settings: Differentiate bit-streams

What storage management needs from Hash FunctionsWhat storage management needs from Hash Functions

Applicationgenerated

bit-streams


errors

Privacy/Security

Differentiatebit-streams

DASH: Differentiating Application Specific HashDASH: Differentiating Application Specific Hash

Applied to… Goals

Any type of bit-streams

Bit-streams generated by computer applications

Effective Differentiation

Reliable Transmission

Secure Encryption

Any type of bit-streamsCRC, etc.

MD5, etc.

DASH

Hash

Files

HashValues

HashGroups

DuplicateGroupsHashes allow duplicate

detection processes to group files into “probable duplicates” groups, reducing further byte-to-byte comparison to be carried out on significantly smaller collections of files.

Hash Functions in Duplicate DetectionHash Functions in Duplicate Detection

Files

HashValues

HashGroups

DuplicateGroups

Low collision prob. =

Efficient duplicate detection

Hash Functions in Duplicate DetectionHash Functions in Duplicate Detection

Computehash value

Computehash value

Comparehash values

Transmithash value

Master site Mirror site

If hash values different, files are different, so master file is sent to mirror for backup.

Transmitfile

Hash Functions in Mirroring/SynchronizationHash Functions in Mirroring/Synchronization

Computehash value

Computehash value

Comparehash values

Transmithash value

Master site Mirror site

If hash values are equal, files are assumed to be equal, so master file is NOT sent for backup.

In this case collision probability must be extremely low, so that the likelihood of not backing up a file that should be backed up is almost nil.

Longer hash values

Lower collision

probabilities

More network load

But..

Hash Functions in Mirroring/SynchronizationHash Functions in Mirroring/Synchronization

101110100010100101000011101110100010100101001011

101110100010100101001011101000100101010001000011

101000100101010001000011101110100010100101000011

101110100010100101000011101110100010100101001011

101110100010100101001011101000100101010001000011

101000100101010001000011101110100010100101000011

Users point of view Stored as

Hash Functions in Compressed RepresentationsHash Functions in Compressed Representations

StandardStorage

101110100010100101000011101110100010100101001011

101110100010100101001011101000100101010001000011

101000100101010001000011101110100010100101000011

Users point of view Stored as

1011101000101001

01000011

101110100010100101001011

1010001001010100

FactoredStorage

Hash Functions in Compressed RepresentationsHash Functions in Compressed Representations

Hash Functions in IndexingHash Functions in Indexing

Error-correcting and detecting hash functions

Cryptographic hash functions

Indexing hash functions

Similar bit-streams↔

Dissimilar fingerprints

Scramble relation between bit-streams

and fingerprints

Similar bit-streams↔

Similar fingerprints

When indexing bit-streams using fingerprints, with the intent of carry out information retrieval, we want similar bit-streams to produce similar fingerprints. This is precisely what most customary hash functions avoid.

When popular hashes such as check sums, CRC, MD5, or SHA are used for the sole purpose of bit-stream differentiation,

• the hash values are larger, and • the hash computation load higher,

than what is necessary and sufficient for the task of differentiating bit-streams.

Further,• the ACTUAL collision probabilities are higher than the claimed best-case scenario, since bit-streams generated by computer applications are not equiprobable.

Consequences of using customary hashesConsequences of using customary hashes

If all bit-streams are random, or their structure is unknown, “balanced hashes” such as CRC, MD5, SHA, etc. have optimal collision probability. Yet, in this case, faster balanced hashes may be used, which reach the same optimal collision probability.

Adapting hash functions to what they’ll hashAdapting hash functions to what they’ll hash

Yet, when dealing with computer generated data, the bit-streams are often not random, but have a given structure specific to the creating application. In this case, the mentioned customary hashes have higher collision probabilities than those inferred by the equiprobable assumption.

Higher probability Lower probability


Yet, when dealing with computer generated data, the bit-streams are often not random, but have a given structure specific to the creating application. In this case, the mentioned customary hashes have higher collision probabilities than those inferred by the equiprobable assumption.

Higher probability Lower probability

Wasting fingerprints on highly unlikely bit-streams…

… fingerprints which would better be used to differentiate highly likely bit-streams.


We had better have fingerprints that are adapted to the likelihoods of the bit-streams. These fingerprints would be

• More effective (lower collision probabilities)• Shorter (lower hash sizes, taking less space)• More efficient (faster to compute)


If we were mostly fingerprinting human beings…

An anthropomorphic exampleAn anthropomorphic example

If we were mostly fingerprinting human beings…

… The above would be a better fingerprinting scheme.

An anthropomorphic exampleAn anthropomorphic example

• Allow duplicate detection applications to - need less space to store file fingerprints- compute fingerprints faster- create smaller candidate duplicate groups- reduce time needed to purge file system of duplicates

• Allow synchronization applications to- compute fingerprints faster- reduce required network load- reduce likelihood of not backing up a file that needed to be backed up

Better hashes to…Better hashes to…

• Allow bit-stream factoring to- avoid data corruption and loss due to collision- pin-point most common bit-streams- hence reduce storage space throughout file system

• Allow hashed indexing to- reflect the semantic relationship of the bit-streams- do so in an efficient manner

Better hashes to…Better hashes to…

• Produce general hashes for a large class of file types• Produce optimal hashes for specific common file types• Design an application that will collect statistics of files in a given file system, file server, or network, and automatically produce hashes that are optimal for the current specific environment• Design an application that will automatically produce optimal hash functions having specific parameters and functionality

A few implementation ideas…A few implementation ideas…

• Design a system to safely dispatch new hashes to all components of a given protocol scope• Research new ways to reduce storage requirements by hashing common bit-streams found in the files of a files system• Produce new indexing hashes for information retrieval and search engines

A few implementation ideas…A few implementation ideas…

Documents

DASH Hash Functions for storage and data management