This article is part of an Essential Guide, our editor-selected collection of our best articles, videos and other content on this topic. Explore more in this guide:
2. - How deduplication is performed: Read more in this section
- The benefits of deduplication and where you should dedupe your data
- Post-process vs. inline deduplication and more
- Software versus hardware backup deduplication
Explore other sections in this guide:
- 1. - Backup deduplication technology today
- 3. - How deduplication is used today
- 4. - Data deduplication challenges
This is the sixth part of a nine-part series on deduplication. For easy access to all nine parts, check out our quick overview of Deduplication 2013.
Source and target deduplication each have advantages and disadvantages. Part six of our series on deduplication in 2013 compares the two technologies and discusses how they are used to complement each other.
The primary disdvantage to source-based deduplication is that the deduplication process consumes hardware resources such as CPU cycles, memory and disk I/O on production servers. This can be especially problematic in virtual data centers where a collection of virtual machines share a physical server's hardware resources. The reason why this can be problematic is because, oftentimes, the host server's resources are spread thin and an increased workload can have a negative impact on host server performance.
Of course, this isn't to say that source deduplication is always disruptive. If a server has adequate hardware resources to accommodate the deduplication process, then source deduplication can prove beneficial. It is also possible to reduce the impact of source deduplication by using post-process deduplication.
Windows Server 2012's native file system deduplication is a particularly good example of a post-process, source deduplication mechanism. There are two main reasons why Microsoft chose to use post-process deduplication.
The first reason is that post-process deduplication helps to avoid resource contention. The deduplication process can be scheduled to occur during off-peak hours when there is less of a load on the server.
The other reason for using post-process deduplication is that post process makes it possible to control what gets deduplicated. For example, Microsoft has designed Windows Server 2012 to not deduplicate data until it has existed for a few days (the actual threshold is customizable). This prevents the deduplication process from needlessly consuming system resources by deduplicating things like temporary files.
Target deduplication offloads the deduplication process from production servers to the secondary storage device (typically SAN hardware or a dedicated appliance). This prevents the production servers from having to perform the deduplication. However, there are a couple of drawbacks to target deduplication as well.
Perhaps the biggest drawback is that the target device can become overwhelmed. If a single target is forced to perform deduplication for multiple servers, then there is always the possibility that the target might not be able to keep pace with the demand. However, this situation is avoidable with proper capacity planning.
Another disadvantage to target deduplication is that data is sent to the target in an uncompressed format, which means that the data transmission will consume more bandwidth than it would had it been compressed ahead of time. This isn't usually much of a problem on a local network, but it can be an issue if the target exists across a WAN link.
While source-based deduplication could be problematic in virtual data centers due to the way that resources such as memory, disk I/O and CPU are shared among the virtual machines, target deduplication can also sometimes be problematic in virtual data centers. It is important to remember that in a virtual data center the virtual machines residing on a host server also share network and/or Fibre Channel connectivity. Server chassis often have a very limited number of slots that can be used for installing network adapters or Fibre Channel host bus adapters. As such, you may find that some virtualization hosts have plenty of available memory, disk and CPU resources, but are lacking in network resources. It is important for administrators to carefully consider the available hardware resources before committing to using source or target deduplication.
Using hardware and software deduplication together
Though source and target deduplication each have advantages, neither deduplication technology is suitable for every situation. As such, some vendors have begun offering both source and target deduplication. EMC provides such capabilities through products such as Data Domain and Avamar, but it is far from being the only vendor to offer source and target deduplication capabilities.
Symantec's NetBackup 5000 Appliance Series, for example, is a backup appliance that offers both source and target deduplication capabilities. Target-side deduplication occurs at the hardware level, while source deduplication can be enabled on protected resources through the backup agents.
Similarly, Quantum has also begun offering both source and target-based deduplication through their DXi Accent software. Like DD Boost, DXi Accent is designed to distribute the deduplication process so that no single device is forced to carry the full burden of the deduplication process. Deduplication can occur on the backup servers and/or on DXi appliances such as Quantum's DXi8500.
There are a number of different situations in which it may be advantageous to use a combination of source and target deduplication.
Source and target deduplication are most commonly combined in situations in which an organization wants to centralize their backups. In this type of environment, source-side deduplication would most likely be used in the branch offices as a way to reduce bandwidth consumption across the WAN links. Servers within the primary data center might instead use target deduplication since bandwidth consumption is less of a concern for servers that are being backed up locally.
It is also common to combine source and target deduplication in environments that handle multi-tenancy (such as a cloud backup provider). In multi-tenant environments, the clients are almost always separated from the backup target by a slow link (typically the Internet). Using source deduplication helps the organization to back up client data efficiently. Although this approach removes the redundancy from the individual backups, the backups might collectively contain some redundancy. Target deduplication can then be used eliminate any cross-tenant redundancy.
A very similar approach can be used to deduplicate backups of multiple servers. Suppose, for instance, that a separate backup job were used for each of an organization's servers. Source-based deduplication could reduce the amount of data that has to be backed up within each backup job. Target deduplication could then be used to search the various backups for any cross job redundancy that can be eliminated.
Learn more about global deduplication in Part 7 of this series on deduplication.
Brien M. Posey, MCSE, has received Microsoft's MVP award for Exchange Server, Windows Server and Internet Information Server (IIS). Brien has served as CIO for a nationwide chain of hospitals and has been responsible for the department of information management at Fort Knox. You can visit Brien's personal website at www.brienposey.com.