Deduplication 2013: Source deduplication

In the fifth part of our series on deduplication in 2013, independent backup expert Brien Posey discusses source deduplication.

This is the fifth part of a nine-part series on deduplication. For easy access to all nine parts, check out our quick overview of Deduplication 2013.

Today, almost every backup vendor offers some form of source-based deduplication. Although there was a period of time when target deduplication arguably became more popular than source deduplication, source deduplication has surged in popularity in recent years. This can be directly attributed to factors such as the popularity of centralized backups and backups to the cloud. Part 5 of our series takes a look at source-based deduplication today.

When cloud backups were first introduced, many of the cloud backup vendors used target deduplication as a way of decreasing their storage costs. The problem with using this approach was that it failed to address one of the biggest disadvantages to cloud backups -- the amount of bandwidth required to complete a cloud backup.

First-generation cloud backup solutions were especially painful because the backup process required an unrealistic amount of Internet bandwidth in order to operate efficiently. To add insult to injury, some of the cloud backup providers also throttled their customers' backups in an effort to consume less bandwidth. However, this approach made the backups run even more slowly than they otherwise would.

Today, most of the cloud backup providers realize that source-side deduplication is a must. Source-side deduplication performs the deduplication process at the customer's location, greatly reducing the vol­ume of data that must be uploaded to the cloud. Of course, many cloud providers still also perform target deduplication as a way of reducing storage requirements. At first, this approach might seem counterintuitive since data has already been deduplicated at the source, but cloud backup providers back up data from multiple clients, and this often results in duplicate data (such as operating system files) being stored.

Source-based deduplication is also heavily used in organizations that perform centralized backups. Oftentimes, for example, large organizations will perform backups at a central location rather than making backups locally in each branch office. Centralized backups reduce complexity and also reduce the cost of backup hardware.

Like cloud backups, however, centralized backups require the data that is being backed up to be trans­mitted across a slow link (typically, the Internet or a leased line). Source deduplication reduces the volume of the data that must be transmitted, thereby decreasing the amount of bandwidth that is consumed. This approach can have the added benefit of decreasing costs if the backup is occurring across a metered connection.

One of the main disadvantages to source deduplication is the fact that the deduplication process consumes CPU, memory and disk I/O resources. That being the case, organizations that perform centralized backups of branch offices often rely on post-process deduplication in which the deduplication process is scheduled to occur at a time when it will have minimal impact on production workloads.

Source deduplication has become so common that it is little more than a checklist item for administrators comparing backup features. Even so, some vendors have taken innovative approaches to the deduplication process. Veeam Backup and Replication, for example, performs deduplication at the backup job level, and the software is designed to complement the native file system deduplication capabilities found in Windows Server 2012.

Symantec has also taken an innovative approach to deduplication with a technology called V-Ray. V-Ray looks inside of the data stream to identify the types of data that are being backed up. This approach allows the software to use the most efficient deduplication algorithm for a given data type rather than using a one-size-fits-all approach to deduplication.

Part 6 of this series compares source and target deduplication and explains how they are typically used to complement each other.

About the Author:
Brien M. Posey, MCSE, has received Microsoft's MVP award for Exchange Server, Windows Server and Internet Information Server (IIS). Brien has served as CIO for a nationwide chain of hospitals and has been responsible for the department of information management at Fort Knox. You can visit Brien's personal website

This was last published in April 2013

Dig Deeper on Data reduction and deduplication

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

Nice article. I agree with the point that source level encryption is becoming just another point in checklist.

I think the challanging part of cloud backup from the point of view of enterpeise CIOs is where the encryption is being performed. This point is definitely going to drive the cloud adoption for enterprise backup.

Doing encryption at source makes it difficult to perform encryption at source (it is well known that both don't go well together) - except by using either a common key or convergent encryption - both having known security issues.

Vaultize seems to be the only available solution that handles encryption and deduplication both at source.

Hope to see some thoughts around this in this series.
Brien,Does this also means that the Exchange Server 2013 SP1 can be fully supported for the Archive mailbox deduplication on the EMC Data Domain CIFS share ? (using UNC path)