Complete guide to backup deduplication
A comprehensive collection of articles, videos and more, hand-picked by our editors
The best way to select the right deduplication offering is to gauge each product being considered against organizational savings expected with higher data reduction, backup/restore performance requirements and the storage capacity consumed by each solution.
Effective data deduplication offers substantial data storage efficiencies that can help reduce data center sprawl. The savings aren't limited to controlling storage infrastructure costs but to also reducing power, cooling and data center footprint costs. Furthermore, the reduced storage footprint reduces operational expenditure in the form of ongoing maintenance, upgrades and management costs.
Vendor data reduction claims can vary widely, with some claiming up to 300:1 dedupe ratios. But let's dive a bit deeper into what a 300:1 deduplication ratio entails, as it's theoretically possible but not very probable. If one conducted full-volume backups every single day for a year, and had a common effective change rate of 0.5% per week, that would mathematically be approximately 300:1. But, of course, full-volume backups every day for a year aren't a realistic scenario. The data dedupe ratios on incremental changes, which is a more realistic scenario, aren't very high. As with automotive efficiency, individual mileage may vary.
Typical data reduction in real-world environments generally yields a 50% to 66% reduction in raw data. Factors that contribute to higher deduplication ratios are data type, daily/weekly data change rate, the number of data streams and where the deduplication takes place. For example, higher deduplication ratios can be achieved on common file system data and databases as compared to compressed data (audio/video/image files), which should be backed up directly without wasting compute horsepower on deduplication analysis. Every organization's data mix is unique; hence, a proof-of-concept test or pilot should be conducted with a short list of vendor solutions.
Performance requirements for faster restores
Every organization has unique RPO and RTO requirements. A corporate disaster recovery (DR) plan must stipulate the order of data recovery and the maximum time allowed for the recovery to complete so that business operations can be resumed.
These stipulations will then lead to the calculation of how much data needs to be recovered in a worst-case scenario. In file-based data backup, the initial backup is a full backup and all subsequent backups are incremental changes only, but DR constitutes a full restore. The restore performance of deduplicated data is typically slower than the ingest performance because in most cases the data needs to be rehydrated to put it in a usable form. Hence, organizational performance requirements based on internal SLAs and desired RPOs/RTOs will stipulate the maximum throughput needed to recover from a disaster. The industry best practice employed for faster restores is to keep disk-based backups online longer for faster recovery.
In a constrained economy, CIOs exert strict financial scrutiny that is sure to magnify any storage capacity inefficiencies. With tight IT budgets, storage capacity is at a premium. Without deduplication, data protection offerings are plagued by slow data backup and restore speeds as well as storage inefficiencies due to the proliferation of redundant data. Data deduplication sends only deduplicated and compressed data across the network, requiring a fraction of the bandwidth, time and cost. This significantly reduces hardware costs while providing excellent storage efficiency resulting in lower Capex and Opex, and raising the data protection bar.
Although deduplication can be performed by backup software or some primary storage arrays, most of those solutions don't support all the data protection protocols (e.g., network-attached storage-based CIFS and NFS, Symantec OST, NDMP and REST APIs), applications and data types. That is where dedicated deduplication appliances come in.
Dedicated deduplication appliances provide massive scalability, protecting petabytes of data in a single system with reduced management costs resulting in lower total cost of ownership.
Finally, selecting a deduplication offering means determining the performance requirements needed to conduct daily and weekly backups, and to meet organizational SLAs and worst-case performance requirements.
About the author: Ashar Baig is the president, principal analyst and consultant at Analyst Connection, an analyst firm focused on storage, storage and server virtualization, and data protection, among other IT disciplines.