How NetApp Dedupe works?
When you 'sis on' a volume, the behaviour of that volume changes, and the change takes place in
two phases:
TWO PHASE PROCESS:
PHASE 1 -> SIS enabled: Pre-process: Before the block is written to the array: Collecting Fingerprint
Note: This is true for new blocks, for the existing data blocks that were written before enabling SIS, you need to run the scan on the
existing data and pull those fingerprints into the catalogue.
PHASE 2 -> SIS Start : Post-process: After the block is written to the array: Sorting, Comparing and
deduping
PHASE 1:
The moment SIS is enabled:
Every time SIS notices a block write request coming in, the sis process makes a call to Data
ONTAP to get a copy of the fingerprint for that block so that it can store this fingerprint in its
catalogue file.
Note: This request interrupts the write string and results in a 7% performance penalty for all writes
into any volume with sis enabled.
PHASE 2:
Now, at some point you'll want to dedupe the volume using the 'sis start' command manually/auto
or via schedule:
SIS goes through the process of comparing fingerprints from the fingerprint database
catalogue file, validating data, and dedupe'ing blocks that pass the validation phase.
Note: In the end all we are really doing is adjusting some inode metadata to say "hey remember that
data that used to be here, well it’s over there now."
IMPORTANT: Nothing about the basic data structure of the WAFL file system has changed, except
you are traversing a different path in the file structure to get to your desired data block. That’s why
NetApp dedupe *usually* has no perceivable impact on read performance - all we've done is
redirect some block pointers. Accessing your data might go a little faster, a little slower, or more
likely not change at all - it all depends on the pattern of the file system data structure and the
pattern of requests coming from the application.
What is a fingerprint?
Fingerprint is a small digital representation of a larger data object. Basically, it is a checksum
character generated by WAFL for each BLOCK for the purpose of consistency checking (This generally
involves the creation of a hash).
Is fingerprint generated by SIS?
No. Each time a WAFL block is created, a 'checksum' character is generated for the purpose of
consistency checking. NetApp Deduplication (SIS) simply "borrows" a copy of this checksum and
stores it in a catalogue as fingerprint.
What happens during post-process dedupe?
A. The fingerprint catalogue is sorted and searched for identical fingerprints.
B. When a fingerprint "match" is made, the associated data blocks are retrieved and scanned byte-
for-byte.
C. Assuming successful validation, the inode pointer metadata of the duplicate block is redirected to
the original block.
D. The duplicate block is marked as "Free" and returned to the system, eligible for re-use.
When to use QSM vs. VSM on dedupe volumes?
Use QSM when you only want to dedupe the destination volume, and use VSM when you want to
dedupe both the source and destination volumes automatically, and save bandwidth during SM
transfers.
Courtesy: Dr. Dedupe, NetApp.
Prepared by: