Besides source and target dedupe, there is also post-process and inline deduplication as well as fixed and variable-block length deduplication. What are the pros and cons to these different approaches?
Each of these approaches has its own set of advantages and disadvantages. Post-process deduplication requires a larger back-end storage pool than inline deduplication, but it also gives you the choice of deduplicating certain workloads and not others. In addition, post-process deduplication gives you the ability to rapidly recover the most recent backup set without rehydrating, a process that usually slows recoveries down to 80% of the backup speeds.
A similar tradeoff exists for block lengths: Algorithms that use variable-block length deduplication are usually slower and produce more metadata, but achieve better compression ratios than fixed-block length algorithms, which are less compute-intensive.
A less-known third type of block hashing called sliding-window is also picking up steam. It can intelligently hash data into different block sizes, depending on the application type, and can better tolerate inserts, changes and metadata than other types of hashing algorithms.
This was first published in October 2012