Data deduplication

Best practices for data deduplication include selecting the right deduping product for your environment and then deciding whether to deduplicate your data at the source or in a storage appliance that your backup software views as a tape library or NAS target.

Over the past two years data deduplication has evolved from Data Domain's revolutionary technology secret to a technology edging its way into the backup mainstream. There are now about a dozen data deduplication products on the market. Choosing the right data deduplication product for your environment and getting the most out of it requires some smart decision-making on a storage administrator's part. Here are a few dos and don'ts to help you join the data deduplication revolution.

1. Pick the right place to deduplicate

Today you can either deduplicate your data at the source or in a storage appliance that your backup software views as a tape library or NAS target.

In addition to reducing the amount of disk space needed to store your backups, deduping at the source also reduces the network bandwidth needed to back up any given set of data. However, there's no such thing as a free lunch and deduping at the source with solutions such as EMC's Avamar and Symantec's PureDisk requires loading a pretty heavy agent on each server to be backed up. And that agent is going to use a bunch of CPU cycles at backup time.

Deduping at the backup target shifts all that compute to the appliance, allowing you to continue to use your current backup software and backup procedures.

Today source deduplication is best suited to backing up remote office servers and other remote servers where network bandwidth is scarce and conventional backup applications don't work that well. Within the data center, deduping at the backup appliance lets you take advantage of deduping technology with maximum performance. . .and minimum pain.

2. Don't worry about hash collisions

Some system administrators have avoided data deduplication because they fear the dreaded hash collision. Most deduping solutions use a hash function such as MD-5 or SHA-1 to describe data blocks and to identify duplicates. Since these hash functions create hashes of just 20 to 64 bytes to describe data blocks that can be 128 KB or larger, there is a mathematical possibility that two sets of different data can generate the same hash value.

The truth is that the probability of a hash collision is very, very small. For a set of n data blocks with hash values from 1 to d the probability of any two blocks having the same value is approximately e-(n(n-1)/2d. So for 10 petabytes of deduped data in 4 KB blocks using SHA-1 (a 160-bit hash), the odds of two blocks having the same value is 1 in 1030. To improve reliability even further some vendors use multiple hash functions or perform bit-by-bit compares of blocks once the hash function declares them to be duplicates.

The naysayers also ignore the fact that all data in any networked system is subject to error while in transit or on disk. TCP/IP and Fibre Channel both use a CRC-32 hash to validate that data arrives at its destination unaltered. The probability that CRC-32 missed a flipped bit in your data as it crossed the SAN or LAN is about 2x10-9 , which is more than 1020 more likely than an MD-5 collision. Disk vendors quote bit error rates on the order of 1x10-16. So as long as the dedupe process is 1000 times more reliable than the disk drives that store the data, a hash collision is the last thing I'm worrying about.

3. Leave enough temp space for post processing

Some deduping appliances, including those from Sepaton, Exagrid and the various FalconStor OEMs, deduplicate data after storing it to disk. These vendors contend that post-processing allows them to boost backup performance. However when comparing these appliances with those from vendors like Data Domain that dedupe as data is received, remember that you'll need to have enough disk space to hold your data before de-duping it.

If your backup scheme makes full backups of all your servers over the weekend, you may need enough extra space to store those full backups until they're deduped.

4. Verify compatibility between your backup software and deduping appliance

Some deduping backup appliances, including those from Exagrid and Sepaton, glean knowledge about backup data by knowing the backup file and/or tape formats created by specific backup software. If you back up data using software that writes in a format the appliance doesn't understand, it won't identify duplicates nearly as well.

Even if the appliance you're considering doesn't use backup format data to identify duplicates, it's a good idea to ask the vendor if they've tested their appliance with your backup application.

About the author: Howard Marks is chief scientist of Networks Are Our Lives Inc., a Hoboken, N.J., network and storage consulting and education firm. Marks' company specializes In bringing the infrastructures and processes of mid-market firms up to enterprise standards in the areas of systems, network and storage management, with a focus on data protection and business continuity planning. Marks is the author of three books and more than 200 articles on network and storage topics since 1987. He is a frequent speaker at industry conferences.

Dig Deeper on Data reduction and deduplication