Complete guide to backup deduplication
A comprehensive collection of articles, videos and more, hand-picked by our editors
Deduplicating disk backup storage targets have very well-defined roles: to reduce the cost of backup storage and to increase the performance of data backups and restores. Faster backups allow organizations to meet data backup windows, which are becoming ever harder to meet amidst the tsunami of data that's growing at rates ranging from 40% to 60% each year in many companies. Faster backups also position IT organizations to meet their service-level agreements (SLAs) with their corporate stakeholders.
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
Deduplication shrinks the amount of data sent offsite, another key benefit that helps to control wide-area bandwidth costs and reduce the storage hardware required to store that data. But deduplication comes with its own costs as it consumes compute horsepower, server RAM and storage hardware IOPS, all of which lead to higher latency that impacts restore performance. Anything that might retard restores could hamper how well an organization can meet its recovery time objectives (RTOs) and recovery point objectives (RPOs), and can be the difference between effectively maintaining normal operations following a data loss event or not being able to conduct business.
The traditional enterprise environment is a crazy quilt of disparate software applications and hardware solutions, each attempting to solve a piece of the data reduction puzzle. Because of the numerous benefits associated with data deduplication, it has achieved ubiquity in data protection environments. Over the past few years, the issue of where data deduplication is located has evolved from the dedicated deduplication appliance, to within the storage array or within the backup software.
How data deduplication technology works
Deduplication works by storing each unique data sequence only once. The initial data backup is a full backup. For all subsequent backups, the deduplication engine discards the duplicate data elements and sends only the changed data bits to the storage array, drastically reducing the amount of data transmitted and significantly increasing the speeds of data backups.
If data needs to be replicated between sites, only the incremental changes are replicated to the deduplicated storage at the secondary or tertiary site. Challenges arise when different vendor technologies are used for data backup, deduplication and the storage array. Data needs to be thawed out or "rehydrated" to be read and sent offsite in full. Furthermore, rehydration causes many deduplication technologies to carry an I/O penalty that causes them to restore at much slower rates than their published ingest rate. This rehydration penalty can significantly increase RTOs, and adversely impact business continuity and the SLAs.
It's important to understand a few key features of deduplication technology when evaluating backup products.
Higher data change rates can significantly reduce the performance of a deduplication engine because, as the change rate increases, the hash table can outgrow system memory, which is especially true for environments with large amounts of data. Furthermore, two simultaneous CPU-intensive operations -- such as the deduplication database being updated and the data deduplication process starting -- will kill the performance of the deduplication engine.
Multistreaming involves sending data to a backup appliance on multiple streams to parallelize deduplication so it can be performed concurrently, thus increasing I/O bandwidth and avoiding bottlenecks.
Since most applications in the data center are single-streamed, most legacy data protection software solutions in the market today are designed to utilize the inline hashing mechanism for deduplication. The software only dedupes data across a single data stream, which results in lower dedupe ratios and significantly increases the cost of data protection.
Variable or fixed block sizes
There are two data parsing approaches to deduplication: fixed and variable. A fixed deduplication algorithm "chunks" data into a fixed block size, which you may be able to choose but it is typically 8 KB. The block size determines the size of the smallest identifiable difference in the data. Variable chunking groups the data into chunks based on patterns in the data itself. If a subsequent backup adds new information to the file or backup stream, there's a shift in the data pattern. The new information is written to disk and all other information is resynchronized and deduplicated accordingly. This allows the deduplication engine to focus on large areas of similar data and zoom in to sub-8 KB areas within the database.
Variable chunking is the more common scheme and is more effective in recognizing duplicate data. On average, variable chunking drives 20 times or higher deduplication ratios. The data is chunked into a sub-8K block size (the industry average is from 8 KB to 32 KB). This increases deduplication ratios through the ability to "match" smaller data more easily and quickly.
Symantec OST plug-in
Data deduplication can be performed by the dedicated deduplication appliance, the storage array or by the backup software. To complete the backup/deduplication catalog and to properly manage the devices within the data protection system, the backup software needs to be aware of the functions performed by other components in the backup system. To address those issues Symantec developed the OpenStorage Technology (OST) plug-in for NetBackup and Backup Exec media servers. OST dramatically increases the performance and reliability of those environments without the need to reconfigure or make radical changes to a data protection system's policies or procedures.
Storage vendors that participate in the Symantec Technology Enabled Program are given access to the OST Software Developers Kit allowing them to create plug-ins that more tightly integrate their devices with NetBackup and Backup Exec.
The OST API is protocol independent, meaning hardware providers can use OST over whatever protocols are best suited for their devices, including Fibre Channel, TCP/IP or SCSI.
The OST API provides many features, but hardware vendors aren't required to support all of them and can select the ones most suited for their devices. These features include:
- Direct copy to dedicated tape drives
- Direct copy to shared tape drives
- Optimized deduplication
- Optimized synthetics
- Concurrent read/write operations on disk that improve utilization and speed
- WAN-optimized image replication to disaster recovery sites
- Power management
About the author: Ashar Baig is the president, principal analyst and consultant at Analyst Connection, an analyst firm focused on storage, storage and server virtualization, and data protection, among other IT disciplines.