Home > Data Backup Tips > Backup and recovery > The pros and cons of file-level vs. block-level data deduplication technology
Data Backup Tips:
EMAIL THIS
 TIPS & NEWSLETTERS TOPICS 

BACKUP AND RECOVERY

The pros and cons of file-level vs. block-level data deduplication technology


Lauren Whitehouse
09.08.2008
Rating: -4.00- (out of 5)


Data backup technical tips
Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google


Data deduplication has dramatically improved the value proposition of disk-based data protection as well as WAN-based remote- and branch-office backup consolidation and disaster recovery (DR) strategies. It identifies duplicate data, removing redundancies and reducing the overall capacity of data transferred and stored.

Some deduplication approaches operate at the file level, while others go deeper to examine data at a sub-file, or block, level. Determining uniqueness at either the file or block level will offer benefits, though results will vary. The differences lie in the amount of reduction each produces and the time each approach takes to determine what's unique.

File-level deduplication

Also commonly referred to as single-instance storage (SIS), file-level data deduplication compares a file to be backed up or archived with those already stored by checking its attributes against an index. If the file is unique, it is stored and the index is updated; if not, only a pointer to the existing file is stored. The result is that only one instance of the file is saved and subsequent copies are replaced with a "stub" that points to the original file.

Block-level deduplication

Block-level data deduplication operates on the sub-file level. As its name implies, the file is typically broken down into segments -- chunks or blocks -- that are examined for redundancy vs. previously stored information.

The most popular approach for determining duplicates is to assign an identifier to a chunk of data, using a hash algorithm, for example, that generates a unique ID or "fingerprint" for that block. The unique ID is then compared with a central index. If the ID exists, then the data segment has been processed and stored before. Therefore, only a pointe...


Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google



RELATED CONTENT
Data reduction and deduplication
Texas Tech turns to data deduplication for data backup, disaster recovery
EMC gives Avamar desktop and laptop support
Data backup and recovery news briefs: Dynamic Solutions introduces data deduplication products
Data archiving reduces data backup workload prior to data deduplication
Arkeia takes aim at EMC Avamar with Kadena Systems data deduplication IP buy
Data backup and recovery news briefs: Druvaa Software updates flagship product, releases inSync v3.1
Data backup and recovery vendors dig into deduplication technology, aim for cloud backup
Data backup and recovery news briefs: Data Domain upgrades data deduplication appliances
Using data deduplication with backup applications: Source vs. target dedupe
Quantum launches midrange data deduplication backup appliances

Data storage backup tools
HP expands laptop and desktop data backup with Data Protector Notebook Extension
Data backup and recovery news briefs: Rackspace unveils cloud-based file storage apps
EMC gives Avamar desktop and laptop support
Terremark acquires managed data backup and recovery provider DS3 DataVaulting
Data backup and recovery news briefs: Dynamic Solutions introduces data deduplication products
Creating a System Recovery Disk in Windows 7: A step-by-step tutorial
Modern data backup and recovery system considerations
Data backup and recovery news briefs: Thales Group releases CryptoStor Tape 3.0 appliance
Data archiving reduces data backup workload prior to data deduplication
Symantec releases Linux version of Backup Exec System Recovery

Backup and recovery
Criteria for choosing the right tape encryption solution for your data backup plan
Creating a System Recovery Disk in Windows 7: A step-by-step tutorial
Modern data backup and recovery system considerations
SQL Server data backup and recovery best practices
Secure your data backups with encryption key management best practices
Using data deduplication with backup applications: Source vs. target dedupe
Data backup for virtual machines: Alternative methods to VMware Consolidated Backup
Upgrading from LTO-3 to LTO-4 tape for data backup and recovery
Is VMware Consolidated Backup right for your enterprise?
Is cloud data backup service right for your organization?

RELATED RESOURCES
2020software.com, trial software downloads for accounting software, ERP software, CRM software and business software systems
Search Bitpipe.com for the latest white papers and business webcasts
Whatis.com, the online computer dictionary


r to the previously stored data needs to be saved. If the ID is new, then the block is unique. The unique ID is added to the index and the unique chunk is stored.

The size of the chunk to be examined varies from vendor to vendor. Some have fixed block sizes, while others use variable block sizes (and to make it even more confusing, a few allow end users to vary the size of the fixed block). Fixed blocks could be 8 KB or maybe 64 KB -- the difference is that the smaller the chunk, the more likely the opportunity to identify it as redundant. This, in turn, means even greater reductions as even less data is stored. The only issue with fixed blocks is that if a file is modified and the deduplication product uses the same fixed blocks from the last inspection, it might not detect redundant segments because as the blocks in the file are changed or moved, they shift downstream from the change, offsetting the rest of the comparisons.

Variable-sized blocks help increase the odds that a common segment will be detected even after a file is modified. This approach finds natural patterns or break points that might occur in a file and then segments the data accordingly. Even if blocks shift when a file is changed, this approach is more likely to find repeated segments. The tradeoff? A variable-length approach may require a vendor to track and compare more than just one unique ID for a segment, which could affect index size and computational time.

The differences between file- and block-level deduplication go beyond just how they operate. There are advantages and disadvantages to each approach.

File-level approaches can be less efficient than block-based deduplication:

  • A change within the file causes the whole file to be saved again. A file, such as a PowerPoint presentation, can have something as simple as the title page changed to reflect a new presenter or date -- this will cause the entire file to be saved a second time. Block-based deduplication would only save the changed blocks between one version of the file and the next.
  • Reduction ratios may only be in the 5:1 or less range whereas block-based deduplication has been shown to reduce capacity in the 20:1 to 50:1 range for stored data.
File-level approaches can be more efficient than block-based data deduplication:

  • Indexes for file-level deduplication are significantly smaller, which takes less computational time when duplicates are being determined. Backup performance is, therefore, less affected by the deduplication process.
  • File-level processes require less processing power due to the smaller index and reduced number of comparisons. Therefore, the impact on the systems performing the inspection is less. The impact on recovery time is low. Block-based deduplication will require "reassembly" of the chunks based on the master index that maps the unique segments and pointers to unique segments. Since file-based approaches store unique files and pointers to existing unique files there is less to reassemble.

About this author: Lauren Whitehouse is an analyst with Enterprise Strategy Group and covers data protection technologies. Lauren is a 20-plus-year veteran in the software industry, formerly serving in marketing and software development roles.

Do you have comments on this tip? Let us know.

Please let others know how useful this tip was via the rating scale below. Do you know a helpful backup tip, timesaver or workaround? Email the editors if you'd like to write tips for SearchDataBackup.com.

Rate this Tip
To rate tips, you must be a member of SearchDataBackup.com.
Register now to start rating these tips. Log in if you are already a member.




DISCLAIMER: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.



Enterprise Backup Solutions - Continuous Data Protection (CDP)
About Us  |  Contact Us  |  For Advertisers  |  For Business Partners  |  Site Index  |  RSS
SEARCH 
TechTarget provides technology professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective purchase decisions and managing their organizations' technology projects - with its network of technology-specific websites, events and online magazines.

TechTarget Corporate Web Site  |  Media Kits  |  Site Map




All Rights Reserved, Copyright 2008 - 2009, TechTarget | Read our Privacy Policy
  TechTarget - The IT Media ROI Experts