Complete guide to backup deduplication
A comprehensive collection of articles, videos and more, hand-picked by our editors
Deduplication has become an essential part of the backup process. However, organizations that routinely produce
large quantities of data have discovered that deduplication, which is intended to optimize the backup process, can actually bog down backup systems.
In an effort to prevent this from happening, some backup vendors have begun offering products that are designed to make the deduplication process occur more efficiently. This article discusses how these "deduplication accelerators" work.
Every backup vendor that offers deduplication acceleration capabilities has its own way of optimizing the deduplication process. For example, EMC offers a global deduplication feature that decreases the workload on the backup appliance by shifting much of the workload to the backup servers, thereby improving performance.
As a general rule, however, deduplication acceleration is based on the principle of workload distribution. Rather than allowing a single device to handle the full burden of the backup deduplication workload, the workload is distributed across multiple devices and can therefore be performed more quickly thanks to parallel processing. The result, obviously, is more efficient deduplication, but this improved deduplication efficiency can be directly attributed to more efficient use of CPU resources and network bandwidth.
To give you a more concrete example of how deduplication acceleration works, consider how inline, target deduplication normally works. This type of architecture typically involves one or more backup servers sending data to a backup appliance. The appliance reviews each block of data that it receives. If the block is unique, then it is written to backup storage. If the block is not unique, then the device may verify that a copy of the block already exists within the backup's storage and a database entry is updated so it associates the block with the data that is currently being backed up.
Although this approach works, it doesn't scale very well. As the volume of data that needs to be backed up increases, the backup appliance can be overwhelmed and the inline deduplication process can become a bottleneck. One way to work around this bottleneck and improve scalability is to use distributed deduplication.
In the previous example, the backup server blindly streamed data to the backup appliance, which then deduplicated and stored the data it received. Data streaming can be thought of as a one-way conversation. In an environment in which deduplication acceleration is being used, the one-way conversation is replaced by a two-way conversation. In other words, the backup server and the backup appliance actively communicate with one another in the interest of making the backup deduplication process much more efficient.
The exact method used varies from one vendor to the next, but the process generally involves having the backup server to determine whether or not data is unique before sending it to the backup appliance. Otherwise, it is up to the backup appliance to make that determination.
To accomplish this, the backup server may hash a block of data that needs to be backed up. Rather than transmitting the entire data block to the backup appliance, the backup server transmits the hash. When the backup appliance receives the hash, it compares the hash against a hash table to determine whether or not the data is unique. If the data is found to be redundant, then there is no need to back up the block.
If, on the other hand, the data is unique, then the backup server must transmit the block to the backup appliance so that it can be backed up. Depending on the product being used, the backup server may try to compress the block prior to transmission. Doing so helps conserve network bandwidth, which is especially important if the backup is occurring across a slow link.
The nice thing about this approach to data deduplication is that it can greatly reduce WAN bandwidth use, while also improving backup speed. Again, the actual resource savings vary from one product to the next, but EMC claims to be able to deliver a 50% improvement in speed while also decreasing network bandwidth use by 80% to 99% with the company's Data Domain Boost product.
Any time a backup appliance performs inline data deduplication, there is a risk of the appliance becoming overwhelmed by the inbound data stream. It's especially problematic if multiple backup servers are sending data to a single backup appliance. Deduplication accelerators can help with this problem by shifting a portion of the deduplication workload to the backup server, thereby reducing resource usage and allowing the backup deduplication solution to scale to meet the required workload.
About the author:
Brien M. Posey, MCSE, has received Microsoft's MVP award for Exchange Server, Windows Server and Internet Information Server. Brien has served as CIO for a nationwide chain of hospitals and has been responsible for the department of information management at Fort Knox. You can visit Brien's personal website at www.brienposey.com.