An important differentiator among deduplication products is whether they work in-band or out-of-band. That is, do they deduplicate the data as they're writing it to the
The out-of-band method has to write the original data, read it, identify its redundancies, and then write one or more pointers if it's redundant. The advantage to this is that you can apply more parallel processes (and processors) to the problem, whereas the in-band method can apply only one process per backup stream. The disadvantage is that the data is written and read more than once, and the multiple reads and writes could cause contention for disk. In addition, the out-of-band method requires slightly more disk than an in-band setup because an out-of-band system must have enough disk to hold the latest set of backups before they're deduplicated. The out-of-band camp counters that slowing down the original backup is unacceptable, and that they'll be able to deduplicate the data in time for tomorrow's backup.
You probably shouldn't dismiss a vendor simply because it uses in-band or out-of-band methods, but definitely test the different deduplication methods to determine how fast they work in your environment. Remember to test the product against many slower backups as well as a smaller number of backups where speed matters. Some systems perform well for single streams, but don't scale for many streams. Some work well only when you send them many streams, but don't perform well with a very fast single stream. Finally, test the deduplication product with enough data to see whether it will handle the amount of data you back up every day. If it doesn't get the deduplication job done every day in time for the next night's backup, you're going to be in trouble.
This was first published in February 2007