Home > Data Backup Tips > Backup and recovery > Data deduplication methods: Block-level versus byte-level dedupe
Data Backup Tips:
EMAIL THIS
 TIPS & NEWSLETTERS TOPICS 

BACKUP AND RECOVERY

Data deduplication methods: Block-level versus byte-level dedupe


Lauren Whitehouse
11.24.2008
Rating: -3.00- (out of 5)


Data backup technical tips
Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google


Data deduplication identifies duplicate data, removing redundancies and reducing the overall capacity of data transferred and stored. In my last article, I reviewed the differences between file-level and block-level data deduplication. In this article, I'll assess byte-level versus block-level deduplication. Byte-level deduplication provides a more granular inspection of data than block-level approaches, ensuring more accuracy, but it often requires more knowledge of the backup stream to do its job.

Block-level approaches

Block-level data deduplication segments data streams into blocks, inspecting the blocks to determine if each has been encountered before (typically by generating a digital signature or unique identifier via a hash algorithm for each block). If the block is unique, it is written to disk and its unique identifier is stored in an index; otherwise, only a pointer to the original, unique block is stored. By replacing repeated blocks with much smaller pointers rather than storing the block again, disk storage space is saved.

More on data deduplication
The pros and cons of file-level vs. block-level data deduplication

Medical center dedupes its way to better backups and disaster recovery

Inline vs. post-processing deduplication appliances

The downsides of data deduplication
The criticism of block-based approaches are 1) the use of a hash algorithm to calculate the unique ID brings the risk of generating a false positive; and 2) storing unique IDs in an index can slow the inspection process as it grows larger and requires disk I/O (unless the index size is kept in check and data comparison occurs in memory).

Hash collisions could spell a false positive when use a hash-based algorithm for determining duplicates. Hash algorithms, such as MD5 and SHA-1, generate a unique number for the chunk of data being examined. While hash collisions and the resulting data corruption are possible, the chances are slim that a hash collision will occur.

Byte-level data deduplication

Analyzing data streams at the byte level is another approach to deduplication. By performing a byte-by-byte comparison of new data streams versus previously stored ones, a higher level of accuracy can be delivered. Deduplication products that use this method have one thing in common: It's likely that the incoming backup data stream has been seen before, so it is reviewed to see if it matches similar data received in the past.

Products leveraging a byte-level approach are typically "content aware," which means the vendor has done some reverse engineering of the backup application's data stream to understand how to retrieve information such as the file name, file type, date/time stamp, etc. This method reduces the amount of computation required to determine unique versus duplicate data. The caveat? This approach typically occurs post-process -- performed on backup data once the backup has completed. Backup jobs, therefore, complete at full disk performance, but require a reserve of disk cache to perform the deduplication process. It's also likely that the deduplication process is limited to a backup stream from a single backup set and not applied "globally" across backup sets.

Once the deduplication process is complete, the solution reclaims disk space by deleting the duplicate data. Before space reclamation is performed, an integrity check can be performed to ensure that the deduplicated data matches the original data objects. The last full backup can also be maintained so recovery is not dependent on reconstituting deduplicated data, enabling rapid recovery.

Which approach Is best?

Both block- and byte-level methods deliver the benefit of optimizing storage capacity. When, where, and how the processes work should be reviewed for your backup environment and its specific requirements before selecting one approach over another. Your vetting process should also include references from organizations with similar characteristics and requirements.

About this author: Lauren Whitehouse is an analyst with Enterprise Strategy Group and covers data protection technologies. Lauren is a 20-plus-year veteran in the software industry, formerly serving in marketing and software development roles.

Do you have comments on this tip? Let us know.

Please let others know how useful this tip was via the rating scale below. Do you know a helpful backup tip, timesaver or workaround? Email the editors to talk about writing for SearchDataBackup.com.

Rate this Tip
To rate tips, you must be a member of SearchDataBackup.com.
Register now to start rating these tips. Log in if you are already a member.




Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google



RELATED CONTENT
Backup and recovery
Secure your data backups with encryption key management best practices
Using data deduplication with backup applications: Source vs. target dedupe
Data backup for virtual machines: Alternative methods to VMware Consolidated Backup
Upgrading from LTO-3 to LTO-4 tape for data backup and recovery
Is VMware Consolidated Backup right for your enterprise?
Is cloud data backup service right for your organization?
Are data backup vendor certifications valuable for backup administrators?
Choosing a Linux system backup tool: Pros and cons of popular Linux backup apps
Dedupe dos and don'ts: Data deduplication technology best practices
Changing data backup software applications: Tips and recommendations

Data reduction and deduplication
Data backup and recovery vendors dig into deduplication technology, aim for cloud backup
Data backup and recovery news briefs: Data Domain upgrades data deduplication appliances
Using data deduplication with backup applications: Source vs. target dedupe
Quantum launches midrange data deduplication backup appliances
Data deduplication software trends in backup and recovery
BakBone phasing out virtual tape library, adds data deduplication with NetVault Backup 8.5
EMC's Slootman: No data deduplication for Disk Library virtual tape library
Online data deduplication calculators don't always add up to accurate dedupe ratios
ExaGrid doubles capacity with EX10000E data deduplication appliances, challenges EMC/Data Domain
Data backup for virtual machines: Alternative methods to VMware Consolidated Backup

Data storage backup tools
Data backup and recovery vendors dig into deduplication technology, aim for cloud backup
Veeam integrates with VMware vStorage APIs in Backup and Replication 4
Data backup and recovery news briefs: Data Domain upgrades data deduplication appliances
Double-Take replication software solves remote-office data backup headache for Lennox International
Using data deduplication with backup applications: Source vs. target dedupe
Plan ahead to avoid bare-metal restore frustration
Even with new and advanced VMware data backup tools, users stick with older technologies
VMware and virtual data backup and recovery technology tutorial
Online data deduplication calculators don't always add up to accurate dedupe ratios
Data backup for virtual machines: Alternative methods to VMware Consolidated Backup

RELATED RESOURCES
2020software.com, trial software downloads for accounting software, ERP software, CRM software and business software systems
Search Bitpipe.com for the latest white papers and business webcasts
Whatis.com, the online computer dictionary

DISCLAIMER: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.



Enterprise Backup Solutions - Continuous Data Protection (CDP)
About Us  |  Contact Us  |  For Advertisers  |  For Business Partners  |  Site Index  |  RSS
SEARCH 
TechTarget provides technology professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective purchase decisions and managing their organizations' technology projects - with its network of technology-specific websites, events and online magazines.

TechTarget Corporate Web Site  |  Media Kits  |  Site Map




All Rights Reserved, Copyright 2008 - 2009, TechTarget | Read our Privacy Policy
  TechTarget - The IT Media ROI Experts