The realistic answer is that it depends on your specific data, your data deduplication deployment environment and the power of the devices you choose. "The in-line approach with a single box only goes so far," said Preston. And without global dedupe, throwing more boxes at the problem won't help. Today, said Preston, "post-processing is ahead, but that will likely change. By the end of the year, Diligent [now an IBM company], Data Domain [Inc.] and others will have global dedupe. Then we'll see a true race."
2. Post-process dedupe happens only after all backups have been completed. Post-process systems typically wait until a given virtual tape isn't being used before deduping it, not all the tapes in the backup, said Preston. Deduping can start on the first tape as soon as the system starts backing up the second. "By the time it dedupes the first tape, the next tape will be ready for deduping," he said.
3. Vendors' ultra-high deduplication ratio claims. Figuring out your ratio isn't simple, and ratios claimed by vendors are highly manipulated. "The extravagant ratios some vendors claim -- up to 400:1 -- are really getting out of hand," said Lauren Whitehouse, an analyst at the Enterprise Strategy Group. The "best" ratio depends on the nature of the specific data and how frequently it changes over a period of time.
"Suppose you dedupe a data set consisting of 500 files, each 1 GB in size, for the purpose of backup," said Dan Codd, CTO at EMC Corp.'s Software Group. "The next day one file is changed. So you dedupe the data set and back up one file. What's your backup ratio? You could claim a 500:1 ratio."
Grey Healthcare Group, a New York City-based healthcare advertising agency, works with many media files, some exceeding 2 GB in size. The company was storing its files on a 13 TB EqualLogic (now owned by Dell Inc.) iSCSI SAN, and backing it up to a FalconStor Software Inc. VTL and eventually to LTO-2 tape. Using FalconStor's post-processing deduplication, Grey Healthcare was able to reduce 175 TB to 2 TB of virtual disk over a period of four weeks, "which we calculate as better than a 75:1 ratio," said Chris Watkis, IT director.
Watkis realizes that the same deduplication process results could be calculated differently using various time frames. "So maybe it was 40:1 or even 20:1. In aggregate, we got 175TB down to 2TB of actual disk," he said.
4. Proprietary algorithms deliver the best results. Algorithms, whether proprietary or open, fall into two general categories: hash-based, which generates pointers to the original data in the index; and content-aware, which looks to the latest backup.
"The science of hash-based and content-aware algorithms is widely known," said Neville Yates, CTO at Diligent. "Either way, you'll get about the same performance."
Yates, of course, claimed Diligent uses yet a different approach. Its algorithm, he explained, uses small amounts of data that can be kept in memory, even when dealing with a petabyte of data, thereby speeding performance. Magnum Semiconductor's Wunder, a Diligent customer, deals with files that typically run approximately 22 KB and felt Diligent's approach delivered good results. He didn't find it necessary to dig any deeper into the algorithms.
"We talked to engineers from both Data Domain and ExaGrid Systems Inc. about their algorithms, but we really were more interested in how they stored data and how they did restores from old data," said Michael Aubry, director of information systems for three central California hospitals in the 19-hospital Adventist Health Network. The specific algorithms each vendor used never came up.
FalconStor opted for public algorithms, like SHA-1 or MD5. "It's a question of slightly better performance [with proprietary algorithms] or more-than-sufficient performance for the job [with public algorithms]," said John Lallier, FalconStor's VP of technology. Even the best algorithms still remain at the mercy of the transmission links, which can lose bits, he added.
5. Hash collisions increase data bit-error rates as the environment grows. Statistically this appears to be true, but don't lose sleep over it. Concerns about hash collisions apply only to deduplication systems that use a hash to identify redundant data. Vendors that use a secondary check to verify a match, or that don't use hashes at all, don't have to worry about hash collisions.
W. Curtis Preston did the math on his Backup Central blog and found that with 95 exabytes of data there's a 0.00000000000001110223024625156540423631668090820313% chance your system will discard a block it should keep as a result of a hash collision. The chance the corrupted block will actually be needed in a restore is even more remote.
"And if you have something less than 95 exabytes of data, then your odds don't appear in 50 decimal places," says Preston. "I think I'm OK with these odds."
This article originally appeared in Storage magazine.
About this author: Alan Radding is a frequent contributor to "Storage" magazine.