kras99 - Fotolia

Get started Bring yourself up to speed with our introductory content.

Dedupe explained: A primer on backup deduplication in 2014

Backup deduplication is so common today, it is often taken for granted. However, there are a number of things to consider before choosing a dedupe approach.

Over the last few years, backup deduplication has gone from being a "nice to have" feature to becoming little more than a checklist item for backup products. In fact, deduplication has become so commonplace that there have been documented situations in which organizations were taking advantage of deduplication without even realizing it.

Given how commonplace deduplication has become, it is worth considering whether there are any legitimate reasons to avoid using it. In order to answer that question, it is necessary to first take a look at how backup deduplication works.

Deduplication architecture

The first thing you must understand about backup deduplication is that it comes in many different forms. Deduplication can be performed at the hardware or the software level, or it can be performed using some combination of the two. Similarly, deduplication can be performed at the data source, on a backup target or both.

Source-side deduplication is helpful in situations where data needs to be transmitted across a slow link. Deduplicating the data at its source allows it to be compressed prior to transmission, thereby allowing data to be transmitted more quickly than would be possible if it were not deduplicated.

Target deduplication occurs at the backup target or on a remote storage device. The primary goal behind target deduplication is to reduce storage costs. Target deduplication shrinks the data, thereby consuming far less physical storage than might otherwise be required.

Sometimes, source and target deduplication are used together. The idea behind this approach is that the data that is being sent to the target can be reduced by deduplicating it at the source. However, if multiple data sources are being used, then it is always possible that there is redundancy that exists across data sources. The target deduplication process eliminates any cross-source redundancy, further reducing backup storage costs.

It is also important to understand that backup deduplication can occur inline or post process. Inline deduplication happens in real time. The data is deduplicated just prior to transmission or storage. Post-process deduplication is used at the storage level and requires the data to initially be stored in an uncompressed form. Post-process deduplication is helpful in that the deduplication process can be deferred until later so system resources are not being consumed by it during periods of peak user activity.

Possible deduplication disadvantages

In most cases, there is no real disadvantage. Some administrators who have been in IT for decades have expressed an aversion to deduplication because it reminds them of a once common form of data corruption called a cross-linked file. As such, they might question the reliability of the deduplication process.

A more common reason why some organizations have expressed a reluctance to use data deduplication is because the process can negatively impact performance in some situations. Consider source-side deduplication, for instance. If it is performed at the software level, it consumes memory and CPU resources and also results in additional disk I/O.

Although the overhead caused in this situation by backup deduplication is undeniable, it may be negligible. Assuming your hardware is adequate for its designated workload (as well as any activity spikes that may occur), there is a good chance the hardware will be able to handle the overhead associated with deduplication without producing a noticeably degraded performance.

Also, the performance impact may be worth it. If source-side deduplication results in a 5% decrease in system performance, but allows data to be transmitted 50 times faster, would the performance impact be worth the increased throughput? Probably. If a 5% performance hit results in the system becoming noticeably sluggish, then the server is probably overloaded.

One last reason why some organizations might avoid using data deduplication is because it is ineffective for some data types. Deduplication can only work if redundancy is present within the data. Data that is highly unique or that is already compressed (such as ZIP files or streaming media files) doesn't receive much benefit from deduplication.

Although there are reasons why some organizations choose not to use deduplication, it is usually in an organization's best interest to employ it. Deduplication technology is mature enough that it is stable and reliable. Furthermore, deduplication can reduce storage costs while allowing bandwidth to be used more efficiently. The key to achieving effective use of backup deduplication technology is to implement the deduplication method (or methods) that are best suited to your organization's unique needs.

About the author:
Brien M. Posey, MCSE, has received Microsoft's MVP award for Exchange Server, Windows Server and Internet Information Server. Brien has served as CIO for a nationwide chain of hospitals and has been responsible for the Department of Information Management at Fort Knox. You can visit Brien's personal website at

Next Steps

The Essential Guide to backup deduplication

Dedupe accelerators help backup deduplication

Backup dedupe is no magic bullet

Dig Deeper on Data reduction and deduplication

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.