Complete guide to backup deduplication
A comprehensive collection of articles, videos and more, hand-picked by our editors
The target deduplication storage appliance is a multibillion dollar market. There are a lot of big names in this market, including EMC DataDomain, Symantec NetBackup, HP StoreOnce, IBM ProtectTier and ExaGrid.
How could such a big market be in decline? This question comes from simple observation.
Target deduplication storage appliances first came out in the early 2000s. They were designed to solve a backup-to-disk problem resulting from exponential data growth's pressure on tape backup windows. Backups were extending outside designated time windows or not completing. Changing storage targets to file storage or disk that looked like tape (e.g., virtual tape) usually solved the performance issue but created a cost problem.
Backup software was architected for tape, which requires data copies by design via a process called "father" (full or complete backup once a week), "son" (incremental backups taken daily for six days), and "grandfather" (the older full backup usually sent offsite). That traditional methodology creates lots of copies of the data, costing little when utilizing inexpensive tape and quite a bit more when utilizing more costly disk.
Few backup software products had built-in deduplication in the early part of the 21st century. Target deduplication storage appliances were born to solve the issue of storing costly multiple copies of backup data on disk and made disk backup cost competitive with tape. The key to its success was that it required zero changes to users' existing backup software.
That original backup problem target deduplication storage appliances aimed to solve is rapidly disappearing because the vast majority of backup software products now have built-in deduplication. Disappearing problem usually means disappearing market.
But it gets worse for target deduplication storage appliances.
Many backup software products provide source-based deduplication that dedupes backup data before it leaves the source (server being backed up). Source deduplication has the additional advantage of accelerating backups by moving less data from each protected server. Less data being backed up decreases the pressure on ever-problematic backup windows. This is why many target deduplication storage appliances are now integrating with source deduplication backup software.
Target deduplication storage appliances are also seeing competition from primary storage systems now offering deduplication as a no-cost feature. Deduplication in primary storage is aimed at primary data and virtualization, but it can also be used as target deduplication storage in lieu of a separate target deduplication storage system.
Now that the backup software copy problem has been mostly resolved, what happens to the target deduplication storage appliance market? This question becomes more urgent when considering that target deduplication storage appliances carry a premium over faster general-purpose storage at the same capacities.
Several of the leading vendors of target deduplication storage appliances seem to realize this. They are reacting to market changes by doing one, some or all of the following:
- Emphasizing what they claim are target storage deduplication advantages over backup software built-in deduplication even when they are not advantages and often marketing exaggerations;
- Repositioning the target storage deduplication appliances for archive as well as backup;
- Bundling the target deduplication storage appliance with backup software.
In the first case, they emphasize deduplication ratios, equivalent storage capacities, equivalent backup performance and replication efficiencies. These appear to be real advantages on the surface. However, examining these claims just a little bit reveals them to lack much in the way of merit.
Let's start with deduplication ratios. Ratios are a comparison. The question is: of what?
There are some products that claim deduplication ratios in the hundreds-to-one range. That sounds incredibly impressive. And while many of the backup software deduplication ratios claims are considerably less impressive, there are a few that claim awe-inspiring ratios. Does this mean that target deduplication storage appliances generally do a better job of deduplication than backup software? No, because none of the products have comparable deduplication ratios. They are not measuring the same things. Ratios are marketecture, not architecture and mean very little.
For ratios to be an effective comparison, every product would have to start with the same backup data. When every backup software product backs up data differently, the ratio becomes an apples-to-oranges comparison. There is no industry or even de facto standard deduplication ratio specification.
Target deduplication storage appliances are losing their relevancy in the data center.
For example, if the backup software does a full-volume backup only once, then does block-level incremental thereafter forever, deduplicates and finally compresses, the ratios will look pretty pedestrian. Compare that to backup software or a target deduplication storage appliance that does a full-volume backup every time, then deduplicates before writing to the target storage. The ratios are dramatically better, but the actual data stored may actually be more.
The most common objection to the "ratio" argument is to compare the target deduplication storage appliance directly with backup software that has deduplication built in. Sometimes it can and most of the time it cannot. To compare requires that the backup software be able to turn off its deduplication. Only a few can.
The better calculation for comparison purposes is to determine how much storage is actually consumed by the backups over a period of time, not the deduplication ratio. The ratios are marketing manipulations, whereas actual storage consumption is an easy-to-determine hard-core measurement.
Equivalent storage capacity -- or the amount of storage necessary without dedupe -- is another marketing assertion based on deduplication ratios. Vendors multiply usable capacity by the deduplication ratio to calculate equivalent storage capacity. Utilizing dubious deduplication ratios makes equivalent storage just as dubious. It is also based on a false assumption that users don't have a choice. They can either deduplicate with that specific product or they would have stored full-volume backups every single day.
Target deduplication storage appliance vendors also like to promote equivalent backup performance. Equivalent backup performance is another dubious claim to equivalent storage, with a twist. It can and does get complicated by source deduplication software such as EMC's DataDomain Boost or Avamar software, Symantec's OST and HP's Catalyst. These software clients are utilized in combination with a target deduplication storage appliance. The source deduplication software typically runs on backup media servers and application servers such as relational databases. That software works in cooperation with the target deduplication storage appliance by deduplicating the data before it leaves the source, reducing the amount of data that is actually streamed into the target dedupe storage appliance. Then the dedupe target performs global deduplication.
Equivalent backup performance, as measured in TBs/hour, is determined by the amount of data stored per hour times the deduplication ratio. It is not a raw throughput number. It is a calculated number based on a non-comparable number.
Raw data throughput is the only performance metric that is relatively standardized and generally compared.
Replication without rehydration is another target deduplication storage appliance vendor emphasis. But data does not have to be rehydrated when replicated. Any storage system with stored deduplicated data provided by backup software or its own storage deduplication doesn't require rehydration when replicated.
Repositioning target deduplication storage appliances for both backup and archive software helps differentiate and expand the potential total available market. It seems to make sense, since archiving is also growing exponentially and feels somewhat like backup.
But archive requirements and characteristics are very different from backup requirements and characteristics. Archives rarely have anywhere near the duplicate data that backup has. In addition, archives must scale from the dozens of petabytes to even exabytes and be searchable quickly with fast queries, and the data residing in them must be highly resilient, durable, persistent and immutable for tens to hundreds of years. Most target deduplication storage appliances can't meet all of these requirements.
Finally, bundling combines backup software, media server or servers, and target deduplication storage appliances in a single converged package. These complete bundled backup appliances are growing much faster than traditional target deduplication storage appliances per IDC's latest published numbers.
Target deduplication storage appliances are losing their relevancy in the data center. Vendor attempts to stop the decline can slow it to a point but cannot stop it. The backup problem target deduplication storage appliances set out to fix is fast becoming a memory.
About the author:
Marc Staimer is founder and senior analyst at Dragon Slayer Consulting in Beaverton, Ore. The 15-year-old consulting practice focuses on the areas of strategic planning, product development, and market development. Marc can be reached at firstname.lastname@example.org.