Today you can either deduplicate your data at the source or in a storage appliance that your backup software views as a tape library or NAS target.
In addition to reducing the amount of disk space needed to store your backups, deduping at the source also reduces the network bandwidth needed to back up any given set of data. However, there's no such thing as a free lunch and deduping at the source with solutions such as EMC's Avamar and Symantec's PureDisk requires loading a pretty heavy agent on each server to be backed up. And that agent is going to use a bunch of CPU cycles at backup time.
Deduping at the backup target shifts all that compute to the appliance, allowing you to continue to use your current backup software and backup procedures.
Today source deduplication is best suited to backing up remote office servers and other remote servers where network bandwidth is scarce and conventional backup applications don't
2. Don't worry about hash collisions
Some system administrators have avoided data deduplication because they fear the dreaded hash collision. Most deduping solutions use a hash function such as MD-5 or SHA-1 to describe data blocks and to identify duplicates. Since these hash functions create hashes of just 20 to 64 bytes to describe data blocks that can be 128 KB or larger, there is a mathematical possibility that two sets of different data can generate the same hash value.
The truth is that the probability of a hash collision is very, very small. For a set of n data blocks with hash values from 1 to d the probability of any two blocks having the same value is approximately e-(n(n-1)/2d. So for 10 petabytes of deduped data in 4 KB blocks using SHA-1 (a 160-bit hash), the odds of two blocks having the same value is 1 in 1030. To improve reliability even further some vendors use multiple hash functions or perform bit-by-bit compares of blocks once the hash function declares them to be duplicates.
The naysayers also ignore the fact that all data in any networked system is subject to error while in transit or on disk. TCP/IP and Fibre Channel both use a CRC-32 hash to validate that data arrives at its destination unaltered. The probability that CRC-32 missed a flipped bit in your data as it crossed the SAN or LAN is about 2x10-9 , which is more than 1020 more likely than an MD-5 collision. Disk vendors quote bit error rates on the order of 1x10-16. So as long as the dedupe process is 1000 times more reliable than the disk drives that store the data, a hash collision is the last thing I'm worrying about.3. Leave enough temp space for post processing
Some deduping appliances, including those from Sepaton, Exagrid and the various FalconStor OEMs, deduplicate data after storing it to disk. These vendors contend that post-processing allows them to boost backup performance. However when comparing these appliances with those from vendors like Data Domain that dedupe as data is received, remember that you'll need to have enough disk space to hold your data before de-duping it.
If your backup scheme makes full backups of all your servers over the weekend, you may need enough extra space to store those full backups until they're deduped.
4. Verify compatibility between your backup software and deduping appliance
Some deduping backup appliances, including those from Exagrid and Sepaton, glean knowledge about backup data by knowing the backup file and/or tape formats created by specific backup software. If you back up data using software that writes in a format the appliance doesn't understand, it won't identify duplicates nearly as well.
Even if the appliance you're considering doesn't use backup format data to identify duplicates, it's a good idea to ask the vendor if they've tested their appliance with your backup application.
About the author: Howard Marks is chief scientist of Networks Are Our Lives Inc., a Hoboken, N.J., network and storage consulting and education firm. Marks' company specializes In bringing the infrastructures and processes of mid-market firms up to enterprise standards in the areas of systems, network and storage management, with a focus on data protection and business continuity planning. Marks is the author of three books and more than 200 articles on network and storage topics since 1987. He is a frequent speaker at industry conferences.
This was first published in December 2007