By W. Curtis Preston
It's hard to overemphasize the importance of data deduplication in today's backup systems. Data dedupe perhaps the biggest game changer since the introduction of network backup systems 15 years ago, and its popularity can be traced to a number of factors. First, data deduplication enables users to increase disk utilization in their backup system. Tape had always been significantly cheaper than disk as a target for backups, and while the cost of backup to disk has decreased significantly in the last several years, so has the cost of tape. So disk was typically used just as a staging mechanism for tape, rather than for long-term backup or archive storage.
Data dedupe changed that forever. The random-access capabilities of disk allow data deduplication systems to remove redundant segments of data and replace them with pointers without significantly affecting restore performance. (While there's some performance degradation, restores are still much faster than when using tape.)
Despite data dedupe's indisputable benefits, a lot of users waited to see if the techniques employed in target dedupe devices would eventually make their way into backup software, making such special-purpose appliances unnecessary. While most experts don't believe that target deduplication appliances are no longer necessary, data deduplication has, indeed, made its way into mainstream backup software products.
EMC Corp. and Symantec Corp. were the first major backup software companies to integrate deduplication into their product lines, and both did it through acquisition. EMC acquired Avamar Technologies, and Symantec's PureDisk product line resulted from its acquisition of Datacenter Technologies. CommVault and IBM Corp. chose to "roll their own" deduplication products.
Source deduplication and backup to disk
EMC and Symantec both offer source deduplication products. That is, you can install the Avamar or PureDisk agent on a computer and the client will communicate with the backup server to identify and eliminate redundant data before it's transferred across the network. Only new bytes are sent with each backup, which makes source deduplication perfect for smaller remote offices and mobile data.
Both vendors offer their source deduplication products as standalone products, which means you don't have to purchase Symantec's NetBackup or EMC's NetWorker. So even if you weren't using Symantec or EMC backup apps, you could take advantage of their deduplication technology. But if you wanted the functionality of both the backup app and dedupe, you had to purchase and manage two products (i.e., NetBackup and PureDisk, or NetWorker and Avamar). Symantec is the first to change this with NetBackup 7, which has built-in source dedupe that doesn't require a separate PureDisk installation. While you can manage Avamar via NetWorker, and a single install of their client software supports both NetWorker and Avamar backups, Avamar still requires a separate server to back up to.
Target deduplication is also available from backup software vendors. Symantec was the first to do this by allowing NetBackup customers to send standard NetBackup backups to a media server where they would be deduplicated by PureDisk. (With NetBackup 7, this functionality is available without requiring a separate PureDisk installation.)
IBM entered the data deduplication space with the introduction of its post-process target deduplication feature in Tivoli Storage Manager (TSM) 6.1. TSM can natively deduplicate its backups stored on disk after they have completed. IBM's target deduplication offering is unique in that it's included in the base product; however, the deduplication ratios it achieves may be relatively modest compared to those of other products' options that you have to pay for.
CommVault's Simpana deduplication facility is difficult to categorize as target or source dedupe. Deduplication in backup software requires multiple steps: (1) slicing files to be backed up into segments or "chunks"; (2) creating a "hash" value (typically using SHA-1); (3) doing a hash table lookup to see if the value is unique; and (4) deciding whether or not to send the chunk to storage. Source deduplication products perform all four steps on the client; target deduplication appliances do all four at the target or backup server. With CommVault's approach, however, steps one and two are done at the client while steps three and four are done at the backup server (media agent in CommVault lingo). This is why it's difficult to classify the dedupe as source or target.
But if the real distinction between the two categories is whether or not the original, native data ever leaves the client, then CommVault Simpana is best placed in the target deduplication category. Still, Simpana's unique practice of doing the first two steps on the client allows it to do something other target products can't do: client-side compression. Most target dedupe systems won't deduplicate your data well if you compress it at the client before sending it to the target because compression inhibits the deduplication system's ability to correctly chunk and fingerprint the data to identify duplicates. But because Simpana chunks and fingerprints the data at the client, it can compress it before sending it across the network with no negative effects. The compression doesn't save as much bandwidth as source deduplication, but it can be advantageous in some environments.
This story was originally published in Storage magazine.