This is the ninth and final part of our series on deduplication. For easy access to all nine parts, check out our quick overview of Deduplication 2013.
The earlier chapters of this series have discussed various methods and technologies that are used to deduplicate data. As you have seen, there are advantages and disadvantages to each of these methods. However, a discussion of deduplication would not be complete without addressing a number of additional important considerations.
The deduplication tax
In some cases, it can take longer to restore deduped data than a comparable restoration might take had the data not been deduplicated at all. The degree to which the deduplication tax becomes an issue depends largely on the nature of the restoration, which vendor's product is being used and the method used to deduplicate the data.
Imagine for a moment that you are using delta differential deduplication for your backups. This method checks to see if data already exists on the backup target, so as to avoid writing data that already exists. This means that as time goes on, less and less data should theoretically be written to the backup, because the deduplication software is able to make use of existing data. As such, the likelihood that the backup software will write an entire file to the backup disks in a contiguous manner also decreases. Instead, the backup software only writes the blocks that are missing.
While the deduplication process might make data backups more efficient, it can actually slow down the restoration process because very little data is stored contiguously on the disk. Instead, the disk subsystem has to deal with extreme fragmentation. The backup software must seek out blocks that are scattered all across the disk, in an effort to rehydrate the data. Disk defragmentation cannot help with this problem, because a single block is likely to be shared by many different files.
Keep in mind that the deduplication tax is not unique to delta differential deduplication. Every form of deduplication incurs some form of penalty with regard to rehydrating deduplicated data.
Backup vendors have come up with a variety of clever methods to minimize the deduplication tax. For example, some of the backup appliance vendors have worked to increase the number of disks that the appliance can accommodate. This not only increases the appliance's capacity, but can also increase performance, because data can be striped across multiple disks. Performance can be further improved by making use of solid-state disks.
Caching is another technique that is sometimes used to reduce the deduplication tax. One of the problems that plagued first-generation storage deduplication products was that if a single storage block became corrupted, the corruption could potentially impact hundreds or even thousands of files. As a way of reducing the chances of this happening, some of the vendors began keeping track of how frequently each block was used. A disk is dedicated to the task of storing a secondary copy of the most frequently used blocks (some vendors store a backup copy of popular blocks on the same disk as the primary copy rather than using a dedicated disk). That way, if a primary copy of a commonly used storage block became corrupted, there would be a spare copy to fall back on. Not long after this technique was introduced, vendors began to realize that it was also possible to cache the most popular storage blocks in memory, as a way of reducing read time during restore operations.
Some products periodically take full backups, which can be used for a rapid restore to the point-in-time the backup was made. Another technique that is sometimes used is called forward-referencing deduplication. This type of deduplication also uses periodic full backups, but takes things a step further. The most recent full backup is treated as the master backup. All of the earlier data residing on the backup store is deduplicated based on the most recent full backup, instead of deduplicating the most recent backup based on preexisting data.
Data protection process compatibility
One last factor that must be taken into account with regard to data deduplication is any other actions that you have taken to protect your data or to reduce disk space consumption. Probably the best example of such a consideration is data encryption. If you encrypt data prior to deduplicating it, then there is a very good chance that the data won't be possible to deduplicate, due to the uniqueness of the encrypted data.
Some forms of replication can also be problematic when used in conjunction with deduplication. Some data protection products are designed to look for changes to storage blocks and then replicate modified blocks to a backup server. If such a product is not designed to be used in conjunction with deduplication, then the deduplication process can cause a file to be replicated. In these cases, the replication software treats the deduplication process as a modification, even though the deduplicated data has not actually been modified. The end result can be a lot of unnecessary replication.
Thankfully, data protection products and technologies are becoming increasingly deduplication-aware. For example, the DFS-R support that is built into Windows Server 2012 is aware of Windows' native deduplication capabilities, and the deduplication process will not trigger an unnecessary replication.
As you can see, there are a number of considerations that must be taken into account when planning for deduplication. The best thing that you can do is to test a variety of deduplication products from several different vendors before you commit to using a specific deduplication method. Remember that a deduplication product's efficiency depends largely on the data that is being deduplicated. This means that a product that has proven to be the most effective in one organization might not necessarily be the best choice for another organization.
About the Author:
Brien M. Posey, MCSE, has received Microsoft's MVP award for Exchange Server, Windows Server and Internet Information Server (IIS). Brien has served as CIO for a nationwide chain of hospitals and has been responsible for the department of information management at Fort Knox. You can visit Brien's personal website at www.brienposey.com.