There are three things to verify when considering a target-based data deduplication product: cost, capacity and...
throughput. When considering the cost of deduplication systems (or any system for that matter), remember to include both capital expenditures (CAPEX) and operational expenditures (OPEX). Look at what hardware and software you'll need to acquire to use a particular appliance to match a given throughput and capacity model.
Some dedupe vendors make it very easy to arrive at a CAPEX number: for example, you need to store 30 TB of data, and you back up 5 TB/day, so you need model x. It includes all the computing and storage capacity you need to meet your requirements. Other vendors just provide a gateway that you can connect to your own storage. Finally, some vendors provide just the software, leaving the purchase of all hardware up to you. Remember to include the cost of the server hardware in this configuration, making sure that you're specifying a server configuration that's approved by that vendor. In both the gateway- and software-only pricing models, make sure to include the cost of the disk in your comparison even if it's "free." The dedupe pricing world is so unique that there are scenarios where you can actually save money by not using disk you already have.
One final cost element: Remember to add in (if necessary) any "extra" disks, such as a "landing zone" (found in post-process systems), a "cache" where data is kept in its original format for faster restores or any disks not used to store deduplicated data. All of those disks should be considered in the total cost of purchasing the system.
You then need to consider OPEX. As you're evaluating each vendor, make note of how you'll need to maintain their systems and how the systems will work with your backup software vendor. Is there a custom interface between the two (e.g., Veritas NetBackup's OST API), or will your system just pretend to be a tape library or a file system? How will that affect your OPEX? What's it like to replace disk drives, disk arrays or systems that are part of this system?
There are two ways to test capacity. The first is to send a significant amount of backups to the device and compare the size of those backups with the amount of storage they take up on the target system. This will show your dedupe ratio. Multiply that ratio times the disk capacity used to store deduped data and you'll get your effective capacity. The second method is to send backups to the device until it fills up and then record how many backups were sent. The latter method takes longer, but it's the only way to know how the system will perform long term. (The performance of some systems decreases as they near capacity.)Performance considerations for dedupe
Finally, there are several things you should test for performance.
Ingest/Write. The first measure of a disk system (dedupe or not) is its ability to ingest (i.e., write) backups. (While restore performance is technically more important, you can't restore what you didn't back up.) Remember to test both aggregate and single-stream backup performance.
Restore/Copy/Read speed. The second measure of a disk system (dedupe or not) is its ability to restore or copy (i.e., read) backups. I like to point out that the whole reason we started doing disk-to-disk-to-tape (D2D2T) backups was to use disk as a buffer to tape; therefore, if a disk system (dedupe or not) isn't able to stream a modern tape drive when copying backups to tape, then it misses the point. Remember to test the tape copy where you plan to do the tape copy; for example, if you plan to replicate to another system and make the tape there, test that. Finally, don't assume that restore speeds will be fine, and remember to test both single-stream and aggregate restore performance.
This article originally appeared in Storage magazine.