What you will learn in this tip: Data deduplication is a must-have when it comes to modern data backup systems. By removing redundancies, deduplication can save as much as 90% of the space needed
When choosing a deduplication product, one of the questions you need to answer is how you're going to dedupe your data. There are numerous approaches available, depending on where you want to do the deduplication, and whether you want to do it in hardware or software. As is often the case, there is no single answer that fits all situations. You have to look at your present system, your expected growth, replacement cycle and other factors to decide which approach is best for your business.
Target deduplication removes redundancies from a backup transmission as it passes through an appliance sitting between the source and the backup target. This is in contrast to source deduplication products, which process deduplication on a client server before sending data across a network to the backup target.
Target deduplication appliances represent the high end of modern deduplication. These deduplication appliances sit between your network and your storage, and provide fast, high-capacity deduplication for your data. With prices for the most basic units starting at approximately $7,000 (more commonly around $25,000) and running up to about $250,000, (depending on capacity and features) they are not cheap. But with throughputs of anywhere from 950 MBps to 27,500 MBps and the potential for several petabytes of attached storage, they offer high-end capabilities for deduping and managing data.
Almost all of these appliances have a lot of storage incorporated -- but not necessarily built in to the appliance -- to handle the needs of deduping, particularly offline deduping. In addition, many target deduplication appliances offer features such as redundant fans and hot-pluggable disks to improve reliability.
Features in target deduplication appliances
If you need this kind of high throughput, it's important to pay close attention to the specifications. Most manufacturers achieve the best throughput by using arrays of appliances. However not all "arrays" are created equal.
One very popular feature in target deduplication appliance is the use of RAID 6 for storage. RAID 6 adds another parity disk to a RAID 5 array for more redundancy in case a second drive fails while the array is being rebuilt. Given the sheer size of storage possible in these deduplication appliances, a second drive failure is a distinct possibility. The need for reliability is also the reason that vendors pay a lot of attention to redundant power supplies, fans and other components.
Source vs. target deduplication
Unlike source dedupe, which dedupes the data at the desktop or other source before it is transmitted over the network, target deduplication does its deduplication at the storage end of the process. This allows one device to handle all the deduplication on the network and offers the potential for more efficient deduplication at the expense of increased network traffic.
In general, source and target deduplication appliances are closely related since they do the same job. Some models, like the Symantec Corp.'s NetBackup 5000 series can serve either as source or target deduplicators, which makes them especially useful in remote offices where data is deduplicated before sent over a WAN.
In deduplication you have a choice between inline deduplication (deduplicating the data as it comes in) and offline dedupe, where the data is stored in the appliance and deduplicated later). While the compute-intensive nature of deduplicating imposes penalties on throughput, most users report little or no practical difference. Some models offer a choice of inline and offline deduplication in the same devices.
Hardware deduplication is only one approach to the problem of deduplication. Most of the major makers of backup software now provide a deduplication option, either as an option in their software, such as CA's ARCserve or CommVault, or as an add-on application like Symantec's PureDisk.
Software-based deduplication is cheaper, at least in acquisition cost, but it is also markedly slower. How slow depends on the underlying hardware, the throughput of the software and the capacity of the network.
In choosing an approach to deduplication, you need to consider your needs, including expected growth, as well as your budget and the features you desire. Compatibility with your backup software and disk array or VTL is also an important consideration. Finally, be somewhat skeptical of vendor's figures, especially for throughput. Your mileage may vary considerably.
About this author: Rick Cook specializes in writing about issues related to storage and storage management.
This was first published in January 2011