This is the fifth part of a nine-part series on deduplication. For easy access to all nine parts, check out our quick overview of Deduplication 2013.
Today, almost every backup vendor offers some form of source-based deduplication. Although there was a period of time when target deduplication arguably became more popular than source deduplication, source deduplication has surged in popularity in recent years. This can be directly attributed to factors such as the popularity of centralized backups and backups to the cloud. Part 5 of our series takes a look at source-based deduplication today.
When cloud backups were first introduced, many of the cloud backup vendors used target deduplication as a way of decreasing their storage costs. The problem with using this approach was that it failed to address one of the biggest disadvantages to cloud backups -- the amount of bandwidth required to complete a cloud backup.
First-generation cloud backup solutions were especially painful because the backup process required an unrealistic amount of Internet bandwidth in order to operate efficiently. To add insult to injury, some of the cloud backup providers also throttled their customers' backups in an effort to consume less bandwidth. However, this approach made the backups run even more slowly than they otherwise would.
Today, most of the cloud backup providers realize that source-side deduplication is a must. Source-side deduplication performs the deduplication process at the customer's location, greatly reducing the volume of data that must be uploaded to the cloud. Of course, many cloud providers still also perform target deduplication as a way of reducing storage requirements. At first, this approach might seem counterintuitive since data has already been deduplicated at the source, but cloud backup providers back up data from multiple clients, and this often results in duplicate data (such as operating system files) being stored.
Source-based deduplication is also heavily used in organizations that perform centralized backups. Oftentimes, for example, large organizations will perform backups at a central location rather than making backups locally in each branch office. Centralized backups reduce complexity and also reduce the cost of backup hardware.
Like cloud backups, however, centralized backups require the data that is being backed up to be transmitted across a slow link (typically, the Internet or a leased line). Source deduplication reduces the volume of the data that must be transmitted, thereby decreasing the amount of bandwidth that is consumed. This approach can have the added benefit of decreasing costs if the backup is occurring across a metered connection.
One of the main disadvantages to source deduplication is the fact that the deduplication process consumes CPU, memory and disk I/O resources. That being the case, organizations that perform centralized backups of branch offices often rely on post-process deduplication in which the deduplication process is scheduled to occur at a time when it will have minimal impact on production workloads.
Source deduplication has become so common that it is little more than a checklist item for administrators comparing backup features. Even so, some vendors have taken innovative approaches to the deduplication process. Veeam Backup and Replication, for example, performs deduplication at the backup job level, and the software is designed to complement the native file system deduplication capabilities found in Windows Server 2012.
Symantec has also taken an innovative approach to deduplication with a technology called V-Ray. V-Ray looks inside of the data stream to identify the types of data that are being backed up. This approach allows the software to use the most efficient deduplication algorithm for a given data type rather than using a one-size-fits-all approach to deduplication.
Part 6 of this series compares source and target deduplication and explains how they are typically used to complement each other.
Brien M. Posey, MCSE, has received Microsoft's MVP award for Exchange Server, Windows Server and Internet Information Server (IIS). Brien has served as CIO for a nationwide chain of hospitals and has been responsible for the department of information management at Fort Knox. You can visit Brien's personal website at www.brienposey.com.
This was first published in April 2013