Choosing between appliances that do inline or post-processing data deduplication can be difficult, and the answer as to which is the best method for your environment often "it depends." To help you decide between
Data backup time
If minimizing backup times is your primary objective, then a post-processing deduplication appliance is almost always the best approach. Using post-processing, the backup data is first stored in its native backup format to disk and then deduplicated after the backup is complete. Conversely, the overhead required to deduplicate data inline can create a bottleneck because before it can store the data. An inline data deduplication appliance must first break apart data in the incoming backup stream into smaller chunks, and then compare these chunks to data that has already been deduplicated.
Quantity of redundant data
This is a critical piece of information that your company should try to ascertain ahead of time. Most businesses have large quantities of redundant data that changes little from day to day. If you suspect, or better yet can document, that this is the case with your company, it can help alleviate performance concerns around inline deduplication. A post-processing appliance is oblivious to data in the incoming backup stream; an inline appliance recognizes redundant data in different backup streams and can deduplicate data more efficiently.
A post-processing deduplication appliance needs to maintain a disk cache that's large enough to store the largest night's backup plus enough additional capacity to store the deduplicated data. Because inline deduplication appliances immediately deduplicate data, they don't have a requirement for this additional disk cache and won't need as much disk capacity.
Offsite replication requirements
If you need to quickly replicate deduplicated data to an offsite location, such as a disaster recovery (DR) site, you should give preference to an inline deduplication appliance. Even though it will likely take longer to complete the backup than using a post-processing appliance, the post-processing appliance requires a window of time to deduplicate the data after the backup is complete. The backup window plus the deduplication window may be longer than the amount of time it takes to deduplicate all of the data inline, so by using an inline deduplication appliance, the process of sending the deduplicated data over the WAN can start sooner.
Copying backup data from disk to tape
If copying data from disk to tape is going to remain a part of a company's data protection strategy, then the company needs to establish what set of data it plans to copy from disk to tape. If the copy is going to occur immediately (zero to 12 hours after the backup completes), then a post-processing appliance has an edge because it doesn't need to reconstruct the deduplicated data before copying it to tape. However, if copying older backup data (day-old or week-old) to tape is the objective, then neither approach necessarily has an advantage.
Multiple clustered server nodes
Clustering servers that provide the processing and memory power necessary to deduplicate data is important in a company that anticipates backing up and deduplicating tens of terabytes or more every night. If a company has fewer than 20 TB of data to back up, either an inline or post-processing approach will generally work. However, once a company scales beyond 20 TB of backup data, it needs to carefully examine the deduplication appliance's architecture and if its architecture can scale to deduplicate data in their environment. This matters more in inline deduplication architectures because the level of performance and memory an inline appliance can allocate to deduplication impacts how quickly backups complete.
About the author: Jerome M. Wendt is lead analyst and president of DCIG Inc.
This was first published in May 2008