Data deduplication promises to reduce the transfer and storage of redundant data, which optimizes network bandwidth and storage capacity. Storing data more efficiently on disk lets you retain data for longer periods or "recapture" data to protect more applications with disk-based backup, increasing the likelihood that data can be recovered rapidly. Transferring less data over the network also improves performance. Reducing the data transferred over a WAN connection may allow organizations to consolidate backup from remote locations or extend disaster recovery to data that wasn't previously protected. The bottom line is that data dedupe can save organizations time and money by enabling more data recovery from disk and reducing the footprint and power and cooling requirements of secondary storage. It can also enhance data protection.
Read the fine print when selecting a data dedupe product
The first point of confusion lies in the many ways storage capacity can be optimized. Data dedupe technology is often a catch-all category for technologies that optimize capacity. Archiving, single-instance storage, incremental "forever" backup, delta differencing and compression are just a few technologies or methods employed in the data protection process to eliminate redundancy and the amount of data transferred/stored. Unfortunately, firms have to wade through a lot of marketing hype to understand what's being offered by vendors who toss around these terms.
In data protection processes, dedupe is a feature available in backup applications and disk storage systems to reduce disk and bandwidth requirements. Data dedupe technology examines data to identify and eliminate redundancy. For example, data dedupe may create a unique data object with a hash algorithm and check that fingerprint against a master index. Unique data is written to storage and only a pointer to the previously written data is stored.
Granularity and dedupe
Another issue is the level of granularity the dedupe solution offers. Dedupe can be performed at the file, block and byte levels. There are tradeoffs for each method, including computational time, accuracy, level of duplication detected, index size and, potentially, the scalability of the solution.
File-level dedupe (or single-instance storage) removes duplicated data at the file level by checking file attributes and eliminating redundant copies of files stored on backup media. This method delivers less capacity reduction than other methods, but it's simple and fast.
Deduplicating at the sub-file level (block level) carves the data into chunks. In general, the block or chunk is "fingerprinted" and its unique identifier is then compared to the index. With smaller block sizes, there are more chunks and, therefore, more index comparisons and a higher potential to locate and eliminate redundancy (and produce higher reduction ratios). One tradeoff is I/O stress, which can be greater with more comparisons; in addition, the size of the index will be larger with smaller chunks, which could result in decreased backup performance. Performance can also be impacted because the chunks have to be reassembled to recover the data.
Byte-level reduction is a byte-by-byte comparison of new files and previously stored files. While this method is the only one that guarantees full redundancy elimination, the performance penalty could be high. Some vendors have taken other approaches. A few concentrate on understanding the format of the backup stream and evaluating duplication with this "content-awareness."
Where and when to dedupe
The work of data dedupe can be performed at one or more places between the data source and the target storage destination. Dedupe occurring at the app or file-server level (before the backup data is transmitted across the network) is referred to as client-side deduplication (a must-have if bandwidth reduction is important). Alternatively, dedupe of the backup stream can happen at the backup server, which can be referred to as proxy deduplication, or on the target device, which is called target-based deduplication.
Deduplication can be timed to occur before data is written to the disk target (inline processing) or after data is written to the disk target (post-processing).
Post-process deduplication will write the backup image to a disk cache before starting to dedupe. This lets the backup complete at full disk performance. Post-process dedupe requires disk cache capacity sized for the backup data that's not deduplicated plus the additional capacity to store deduped data. The size of the cache depends on whether the dedupe process waits for the entire backup job to complete before starting deduplication or if it starts to deduplicate data as it's written and, more importantly, when the deduplication process releases storage space.
Inline dedupe could negatively impact backup performance when the app uses a fingerprint database that grows over time. Inline approaches inspect and dedupe data on the way to the disk target. Performance degradation depends on several factors, including the method of fingerprinting, granularity of dedupe, where the inline processing occurs, network performance, how the dedupe technology workload is distributed and more.
Hardware versus software deduplication
Many of today's most popular hardware-based approaches may solve the immediate problem of reducing data in disk-to-disk backup environments, but they mask the issues that will arise as the environment expands and evolves.
The issue is software versus hardware. On the hardware side, purpose-built appliances offer faster deployments, integrating with existing backup software and providing a plug-and-play experience. The compromise? There are limitations when it comes to flexibility and scalability. Additional appliances may need to be added as demand for capacity increases, and the resulting appliance "sprawl" not only adds management complexity and overhead, but may limit deduplication to each individual appliance.
With software approaches, disk capacity may be more flexible. Disk storage is virtualized, appearing as a large pool that scales seamlessly. In a software scenario, the impact on management overhead is less and the effect on deduplication may be greater since deduplication occurs across a larger data set than most individual appliance architectures.
Software-based client-side and proxy dedupe optimize performance by distributing dedupe processing across a large number of clients or media servers. Target dedupe requires powerful, purpose-built storage appliances as the entire backup load needs to be processed on the target. Because software implementations offer better workload distribution, inline dedupe performance may be improved over hardware-based equivalents.
Choosing a software or hardware approach may depend on your current backup software implementation. If the backup software in place doesn't have a dedupe feature or option, switching to one that does may pose challenges.
This article originally appeared in Storage magazine.
About this author: Lauren Whitehouse is an analyst with Enterprise Strategy Group and covers data protection technologies. Lauren is a 20-plus-year veteran in the software industry, formerly serving in marketing and software development roles.