Deduplication involves checking each file or block being backed up and replacing any duplicates with pointers to a single copy. Because so much of what is stored in a business environment is redundant (e.g., emails to multiple recipients), the result can be a savings of 25 to 1 or more on storage space.
The other major drawback to data deduplication is the lack of standards. Every major vendor does data deduplication its own way, and as a result files have to be reconstructed by equipment from the same vendor whose products were used to create the data deduplication file in the first place. Since the efficiency of deduplication algorithms -- which are mostly proprietary -- is a major competitive advantage, this isn't likely to change.
If you use a remote system for disaster recovery (DR), you need to make sure the devices that aren't deduped are available at the DR site, as well the people trained to use them. Data deduplication also raises the issue of vendor lock-in.
Target-based vs. source-based deduplication
There are two main methods of data deduplication currently in use: target- and source-based data deduplication.
In target-based data deduplication, deduplication is handled by a device such as a virtual tape library (VTL), which has data deduplication built in. Your backup software doesn't change and the data is deduped after it has been sent over the network. This lets you continue to use the balance of your backup system as it already exists, especially your backup software, but it doesn't do anything to save bandwidth. One way around the bandwidth issue, which becomes especially important to remote sites communicating over a WAN, is to have a VTL at the remote site, deduplicate the data there and transmit it to the backup server. If the cost of network capacity is high enough, and the cost of the VTLs is low enough, this can save you money. However, it requires have a data deduplication device at every site to be protected.
In source-based deduplication, the deduplication is done by the backup software. The software on the clients communicate with software on the backup server to check each file or block to see if it is a duplicate. Duplicates are replaced by pointers before the data is sent to the server.
Source-based deduplication conserves bandwidth without the need to have extra hardware at the source. However, it requires a lot of extra communication between the server and the clients since each piece of data (block or file) has to be checked against the server's list of already-present pieces of data.
Which system is faster depends very much on the specifics of the installation. If you have a lot of data, say multiple terabytes, target deduplication is usually faster, but if you have smaller amounts of data, other factors such as network performance overwhelm the differences.
About this author: Rick Cook specializes in writing about issues related to storage and storage management.
This was first published in September 2008