This is the seventh part of a nine-part series on deduplication. For easy access to all nine parts, check out our quick overview of Deduplication 2013.
Imagine, for example, that an organization has a disk-based backup system and needs to create deduplicated backups of five file servers. There are a number of different ways that such a backup could be accomplished. One option might be to perform source deduplication of each server, so that the data is deduplicated prior to being written to the backup target.
One of the problems with this approach is that deduplication is occurring on a per-file-server basis. The data on each file server is being deduplicated, but it is likely that two or more servers could also contain duplicate data. Using this type of backup, cross-server duplicate data is not removed as a part of the deduplication process. Hence, the backup target could end up storing duplicate data even though the individual servers have been deduplicated.
One especially popular solution to this problem is source + target deduplication. In this situation, the data would be deduplicated on the individual file servers, but there would also be an inline deduplication process that runs on the backup target as a way of making sure that no redundant data is stored in the backups.
While the prospect of source + target deduplication might sound promising, there is one major disadvantage. Source + target deduplication solutions tend not to scale very well. Often, the sheer volume of data that needs to be backed up causes the backup target’s controller to become a bottleneck.
Global deduplication is similar to source + target deduplication in that data is deduplicated at the source and again at the backup target. However, global deduplication solutions attempt to eliminate bottlenecks through load balancing.
A backup target that is designed for global deduplication typically presents itself to the backup servers as a single pool of storage. When the backup process begins, however, the inbound data is dynamically load-balanced across multiple controllers. These controllers help divide the workload, thus allowing more data to be deduplicated than would be possible using a single controller.
Examples of global deduplication
Because global deduplication solutions are specifically designed to address the scalability shortcomings of more traditional deduplication solutions, global deduplication tends to be implemented primarily in large data centers. Global deduplication products are mainly available from vendors that offer enterprise-class storage products.
One of the leaders in global deduplication is EMC Corp., which offers a product called the EMC Data Domain Global Deduplication Array. The EMC Global Deduplication Array load-balances the deduplication process across two high-end Data Domain controllers.
The EMC Global Deduplication Array is designed specifically to work with the EMC Data Domain Boost software and EMC Networker, Symantec NetBackup or Symantec Backup Exec. However, organizations that use a different backup application can use the Global Deduplication Array by connecting it to the existing backup infrastructure as a virtual tape library.
CommVault Systems Inc. also offers global deduplication, but takes a different approach than EMC. While EMC's global deduplication is based on the use of a hardware appliance, CommVault takes a software approach to global deduplication.
In Simpana 9, the deduplication process begins at the data source by leveraging the backup client software. This first step reduces the amount of data which must be transferred across the network. CommVault also uses media agents as secondary deduplication points. By using a software-only approach to deduplication, CommVault provides organizations with the flexibility to use the type of backup storage that makes the most sense for them (DAS, NAS, SAN, etc.), as opposed to being locked into using a hardware appliance.
Another major player in the global deduplication space is Symantec Corp. Like EMC, Symantec uses a hardware appliance in their global deduplication strategy. Symantec's global deduplication solution is based around the NetBackup 5000 series appliance and uses both source-side and target deduplication.
Source deduplication is performed by the Symantec NetBackup client, while the NetBackup appliance handles target deduplication.
Data can be replicated from the NetBackup 5000 to another NetBackup appliance as a way of protecting the storage pool's contents. Deduplication is also used in the replication process, ensuring that any data that already exists on the replica are not needlessly transmitted.
Learn who should consider global deduplication in part eight of our series.
About the Author:
Brien M. Posey, MCSE, has received Microsoft's MVP award for Exchange Server, Windows Server and Internet Information Server (IIS). Brien has served as CIO for a nationwide chain of hospitals and has been responsible for the department of information management at Fort Knox. You can visit Brien's personal website at www.brienposey.com.