Restores from some data dedupe systems can be slower than modern tape drives, and restores from other deduplication systems can be faster than tape could ever dream of. In this column, you will learn what you need to do to make sure your dedupe system is on the right side of those two extremes.
To explain why there is what many call a "dedupe tax," we need to go back in time to when there was no data deduplication technology. Before dedupe, we wrote data to tape or disk in contiguous blocks. (Writing contiguous blocks of backup data to disk requires either an empty filesystem, or a disk system designed for that purpose, such as a virtual tape library.) The blocks that comprised a given backup were all located in proximity to each other.
Data backup systems also occasionally perform a full backup, a synthetic full backup, or otherwise collocate files necessary for a complete restore (e.g., IBM TSM's active data pools). These typical behaviors of backup systems meant that the bulk of the blocks needed to restore a given system would all be placed contiguously on disk or tape, making a complete restore of that system very easy to accomplish.
Most would agree that the best thing for fast restores would be to have a recent full or (updated active data pool) on disk, and to have any subsequent backups also on disk. This is why it comes as a surprise to many when they read that restores from a deduplication system (which is almost always on disk) could be slower than what they're used to -- even slower than from tape.
The reason for this is that no matter how you back up your data to a deduplication system, it is rarely stored contiguously on disk. Since the blocks that comprise the latest backup were actually created over time, those blocks will be stored all over the dedupe system, based on when they were backed up. A restore from deduped data is therefore a very fragmented read from disk. Instead of a single disk seek followed by a large disk read of hundreds or thousands of blocks, you could have hundreds of disk seeks and reads from hundreds of different disk drives.
Analyzing your deduplication system's performance
So how is it that some deduplication systems will have better read performance than other systems? It depends on a few things. The first factor is the manner in which the systems store data each day and the degree to which their data storage method fragments the data. The second factor is the degree to which they do things to mitigate the negative restore effects of their data storage methods. Finally, some dedupe systems may have a single stream limitation that can impact their restore speed.
The first thing that affects how dedupe systems store data is what they do when they find a match to an older segment of data. Do they leave the old segment in place and write a pointer to the older segment instead of writing the new segment to disk (reverse referencing)? Or do they write the new segment to disk, delete the older segment and replace it with a pointer to the new segment (forward referencing)? Deduplication systems that continually write pointers for newer, redundant segments may result in more fragmentation for newer data. Systems that always write newer data and delete older data may result in more contiguously stored data from newer backups. Forward referencing is only possible in post-processing deduplication systems. This method of storing data should result in faster restores of more recent data, which is where most restores come from.
The next thing to consider is whether the system tries to collocate segments that it can collocate. When a system is accepting a stream of backup, it may have some knowledge that certain data segments are associated with other data segments and do its best to try to store them in a way that allows for more contiguous storage of related segments.
Another thing that may impact how data is stored is how the dedupe system does what is commonly referred to as garbage collection. As data is expired, some data segments will no longer be needed and can be discarded. Some vendors delete such segments with no regard to how this increases the fragmentation of the remaining data. Other vendors consider such things during their garbage collection process, which is typically run on a daily basis.
Mitigating the data dedupe tax
Vendors do a number of things to mitigate the "dedupe tax." Some vendors do extra work during their garbage collection process to actually relocate related segments together, especially those segments that do not have anything in common with other backups. Think of this as a fancy defragmentation process. Other vendors keep last night's backup stored on disk in its native format so that if someone asks to restore (or copy) from last night's backup, they can satisfy that need without bringing dedupe into the picture. While this is very desirable from a restore speed perspective, it is another technology that requires a post-process architecture, which carries with it the concept of a landing zone where backups are stored in their native format before they are deduped. That landing zone will require extra disk that is not required in an inline configuration.
Finally, some vendors also suffer from a single stream limitation.That is, their architecture does not allow a single stream of data out of their product any faster than n MBps. If you plan on copying data from a deduplication system to tape, its single stream restore speed capabilities are paramount. This is because a tape copy is essentially a very demanding restore; not only does the device have to re-assemble all the appropriate bits back to their native form, they must do so in a very fast single stream of data. There is no point in trying to copy data to an LTO-5 tape drive that wants 240 MBps (assuming 1.5:1 compression) if the fastest your deduplication system can supply is 90 MBps.
Suffice it to say that not all dedupe systems are created equal when it comes to restore speed. Make sure you do your homework.
Editor's tip: For more on deduplication systems, read Curtis' Deduplication and backup tutorial.
About this author: W. Curtis Preston (a.k.a. "Mr. Backup"), Executive Editor and Independent Backup Expert, has been singularly focused on data backup and recovery for more than 15 years. From starting as a backup admin at a $35 billion dollar credit card company to being one of the most sought-after consultants, writers and speakers in this space, it's hard to find someone more focused on recovering lost data. He is the webmaster of BackupCentral.com, the author of hundreds of articles, and the books "Backup and Recovery" and "Using SANs and NAS."
This was first published in April 2010