Home > Data Backup News > The top five data deduplication misconceptions
Data Backup News:
EMAIL THIS

The top five data deduplication misconceptions

By Alan Radding
30 Sep 2008 | SearchDataBackup.com

Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us   

Because data deduplication products are relatively new, based on different technologies and algorithms, and are upgraded often, there are a number of myths and misconceptions about various forms of the technology. The following are the top data deduplication myths according data backup and data protection experts.

1. Data dedupe myth No. 1: In-line deduplication is better than post-processing. "If your backups aren't slowed down, and you don't run out of hours in the day, does it matter which method you chose? I don't think so," said Executive Editor and Independent Backup Expert W. Curtis Preston.

More on data deduplication
Data deduplication tutorial

Data replication for backup best practices

Data deduplication methods: Block-level versus byte-level dedupe

Data Domain rolls out speedier midrange data deduplication box
John Wunder, director of IT at Milpitas, CA-based Magnum Semiconductor, said his in-line data dedupe works just fine. "If there's a delay, it's very small; and since we're going directly to disk, any delay doesn't even register."

The realistic answer is that it depends on your specific data, your data deduplication deployment environment and the power of the devices you choose. "The in-line approach with a single box only goes so far," said Preston. And without global dedupe, throwing more boxes at the problem won't help. Today, said Preston, "post-processing is ahead, but that will likely change. By the end of the year, Diligent [now an IBM company], Data Domain [Inc.] and others will have global dedupe. Then we'll see a true race."

2. Post-process dedupe happens only after all backups have been completed. Post-process systems typically wait until a given virtual tape isn't being used before deduping it, not all the tapes in the backup, said Preston. Deduping can start on the first tape as soon as the system starts backing up the second. "By the time it dedupes the first tape, the next tape will be ready for deduping," he said.

3. Vendors' ultra-high deduplication ratio claims. Figuring out your ratio isn't simple, and ratios claimed by vendors are highly manipulated. "The extravagant ratios some vendors claim -- up to 400:1 -- are really getting out of hand," said Lauren Whitehouse, an analyst at the Enterprise Strategy Group. The "best" ratio depends on the nature of the specific data and how frequently it changes over a period of time.

"Suppose you dedupe a data set consisting of 500 files, each 1 GB in size, for the purpose of backup," said Dan Codd, CTO at EMC Corp.'s Software Group. "The next day one file is changed. So you dedupe the data set and back up one file. What's your backup ratio? You could claim a 500:1 ratio."

Grey Healthcare Group, a New York City-based healthcare advertising agency, works with many media files, some exceeding 2 GB in size. The company was storing its files on a 13 TB EqualLogic (now owned by Dell Inc.) iSCSI SAN, and backing it up to a FalconStor Software Inc. VTL and eventually to LTO-2 tape. Using FalconStor's post-processing deduplication, Grey Healthcare was able to reduce 175 TB to 2 TB of virtual disk over a period of four weeks, "which we calculate as better than a 75:1 ratio," said Chris Watkis, IT director.

Watkis realizes that the same deduplication process results could be calculated differently using various time frames. "So maybe it was 40:1 or even 20:1. In aggregate, we got 175TB down to 2TB of actual disk," he said.

4. Proprietary algorithms deliver the best results. Algorithms, whether proprietary or open, fall into two general categories: hash-based, which generates pointers to the original data in the index; and content-aware, which looks to the latest backup.

"The science of hash-based and content-aware algorithms is widely known," said Neville Yates, CTO at Diligent. "Either way, you'll get about the same performance."

Yates, of course, claimed Diligent uses yet a different approach. Its algorithm, he explained, uses small amounts of data that can be kept in memory, even when dealing with a petabyte of data, thereby speeding performance. Magnum Semiconductor's Wunder, a Diligent customer, deals with files that typically run approximately 22 KB and felt Diligent's approach delivered good results. He didn't find it necessary to dig any deeper into the algorithms.

"We talked to engineers from both Data Domain and ExaGrid Systems Inc. about their algorithms, but we really were more interested in how they stored data and how they did restores from old data," said Michael Aubry, director of information systems for three central California hospitals in the 19-hospital Adventist Health Network. The specific algorithms each vendor used never came up.

FalconStor opted for public algorithms, like SHA-1 or MD5. "It's a question of slightly better performance [with proprietary algorithms] or more-than-sufficient performance for the job [with public algorithms]," said John Lallier, FalconStor's VP of technology. Even the best algorithms still remain at the mercy of the transmission links, which can lose bits, he added.

5. Hash collisions increase data bit-error rates as the environment grows. Statistically this appears to be true, but don't lose sleep over it. Concerns about hash collisions apply only to deduplication systems that use a hash to identify redundant data. Vendors that use a secondary check to verify a match, or that don't use hashes at all, don't have to worry about hash collisions.

W. Curtis Preston did the math on his Backup Central blog and found that with 95 exabytes of data there's a 0.00000000000001110223024625156540423631668090820313% chance your system will discard a block it should keep as a result of a hash collision. The chance the corrupted block will actually be needed in a restore is even more remote.

"And if you have something less than 95 exabytes of data, then your odds don't appear in 50 decimal places," says Preston. "I think I'm OK with these odds."

This article originally appeared in Storage magazine.

About this author: Alan Radding is a frequent contributor to "Storage" magazine.



Tags: Data reduction and deduplicationData storage backup toolsVIEW ALL TAGS

Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us   



RELATED CONTENT
Data reduction and deduplication
Texas Tech turns to data deduplication for data backup, disaster recovery
EMC gives Avamar 5 desktop and laptop data backup support
Data backup and recovery news briefs: Dynamic Solutions introduces data deduplication products
Data archiving reduces data backup workload prior to data deduplication
Arkeia takes aim at EMC Avamar with Kadena Systems data deduplication IP buy
Data backup and recovery news briefs: Druvaa Software updates flagship product, releases inSync v3.1
Data backup and recovery vendors dig into deduplication technology, aim for cloud backup
Data backup and recovery news briefs: Data Domain upgrades data deduplication appliances
Using data deduplication with backup applications: Source vs. target dedupe
Quantum launches midrange data deduplication backup appliances

Data storage backup tools
A review of VMware disk-to-disk backup apps: Veeam, Vizioncore, PHD Virtual and VDR
HP expands laptop and desktop data backup with Data Protector Notebook Extension
Data backup and recovery news briefs: Rackspace unveils cloud-based file storage apps
EMC gives Avamar 5 desktop and laptop data backup support
Terremark acquires managed data backup and recovery provider DS3 DataVaulting
Data backup and recovery news briefs: Dynamic Solutions introduces data deduplication products
Creating a System Recovery Disk in Windows 7: A step-by-step tutorial
Modern data backup and recovery system considerations
Data backup and recovery news briefs: Thales Group releases CryptoStor Tape 3.0 appliance
Data archiving reduces data backup workload prior to data deduplication

RELATED RESOURCES
2020software.com, trial software downloads for accounting software, ERP software, CRM software and business software systems
Search Bitpipe.com for the latest white papers and business webcasts
Whatis.com, the online computer dictionary



Data Backup Security: Tape Encryption & Backup Security
About Us  |  Contact Us  |  For Advertisers  |  For Business Partners  |  Site Index  |  RSS
SEARCH 
TechTarget provides technology professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective purchase decisions and managing their organizations' technology projects - with its network of technology-specific websites, events and online magazines.

TechTarget Corporate Web Site  |  Media Kits  |  Site Map




All Rights Reserved, Copyright 2008 - 2009, TechTarget | Read our Privacy Policy
  TechTarget - The IT Media ROI Experts