Get started Bring yourself up to speed with our introductory content.

The downsides of data deduplication technology explained

Data deduplication is currently one of the hottest areas of backup. But there are still some areas where data deduplication has some growing to do.

Data deduplication companies like Data Domain Inc., EMC Corp. and ExaGrid Systems are reporting record growth as users come to understand the value proposition of data dedupe and adopt it widely. But that doesn't mean that data deduplication is completely mature or ready for general storage use. There are still some areas, such the lack of deduplication standards, where data deduplication has some growing to do. Similarly, data deduplication is best adapted to specific kinds of situations, notably data backups.

Deduplication involves checking each file or block being backed up and replacing any duplicates with pointers to a single copy. Because so much of what is stored in a business environment is redundant (e.g., emails to multiple recipients), the result can be a savings of 25 to 1 or more on storage space.

Data deduplication extracts a performance and resource penalty, especially when reconstructing a file.
So why wouldn't you use data deduplication for all kinds of storage? The reason is that data deduplication extracts a performance and resource penalty, especially when reconstructing a file. Generally the cost in speed and resources is high enough that data deduplication isn't attractive for ordinary storage. Backups are different because data is typically written once and read infrequently. The penalties associated with data deduplication are much less for a properly designed backup system.

The other major drawback to data deduplication is the lack of standards. Every major vendor does data deduplication its own way, and as a result files have to be reconstructed by equipment from the same vendor whose products were used to create the data deduplication file in the first place. Since the efficiency of deduplication algorithms -- which are mostly proprietary -- is a major competitive advantage, this isn't likely to change.

If you use a remote system for disaster recovery (DR), you need to make sure the devices that aren't deduped are available at the DR site, as well the people trained to use them. Data deduplication also raises the issue of vendor lock-in.

Target-based vs. source-based deduplication

There are two main methods of data deduplication currently in use: target- and source-based data deduplication.

In target-based data deduplication, deduplication is handled by a device such as a virtual tape library (VTL), which has data deduplication built in. Your backup software doesn't change and the data is deduped after it has been sent over the network. This lets you continue to use the balance of your backup system as it already exists, especially your backup software, but it doesn't do anything to save bandwidth. One way around the bandwidth issue, which becomes especially important to remote sites communicating over a WAN, is to have a VTL at the remote site, deduplicate the data there and transmit it to the backup server. If the cost of network capacity is high enough, and the cost of the VTLs is low enough, this can save you money. However, it requires have a data deduplication device at every site to be protected.

In source-based deduplication, the deduplication is done by the backup software. The software on the clients communicate with software on the backup server to check each file or block to see if it is a duplicate. Duplicates are replaced by pointers before the data is sent to the server.

Source-based deduplication conserves bandwidth without the need to have extra hardware at the source. However, it requires a lot of extra communication between the server and the clients since each piece of data (block or file) has to be checked against the server's list of already-present pieces of data.

Which system is faster depends very much on the specifics of the installation. If you have a lot of data, say multiple terabytes, target deduplication is usually faster, but if you have smaller amounts of data, other factors such as network performance overwhelm the differences.

About this author: Rick Cook specializes in writing about issues related to storage and storage management.

Dig Deeper on Data reduction and deduplication