Data deduplication approaches in backup today

The data deduplication market has become a confusing landscape. asked users about the deduplication products they've deployed, and the pros and cons of each.

Beth Pariseau, Senior News Writer

Data deduplication is the hottest topic for data storage pros in 2009. It's no longer a bleeding-edge technology, and the cost savings of data reduction is especially appealing as organizations cope with data growth and tightening budgets. In response, storage vendors are pumping out more data deduplication products and approaches to cut down on the size of data stores.

As a result, dedupe is following what has become a familiar pattern in IT: A new technology comes along to fill a long-standing need, but as it becomes widely implemented, questions arise over how best to use it and how it may affect the rest of the environment. Fortunately, there are enough storage pros with experience using deduplication to help wade through the pros and cons of all the approaches.

"A number of vendors have come to market with dedupe as a feature, with more to come," Enterprise Strategy Group analyst Lauren Whitehouse said. "There are new avenues for customers to explore, but there's also a new level of confusion."


Data deduplication approaches
Hardware-based deduplication products
Inline vs. post-process data deduplication
Spanning the globe: Global dedupe
Symantec OpenStorage API: A hybrid approach
Software-only dedupe approaches

Data deduplication approaches

Independent backup expert W. Curtis Preston said data deduplication approaches can be divided into two broad categories: those that ship with hardware, and those that are software only. Hardware products include IP-connected network-attached storage (NAS) boxes and Fibre Channel-connected virtual tape libraries (VTLs). In software, a rough distinction can be made between "source" products that handle dedupe processing at the server level, and "target" products that do the processing at the NAS or VTL disk.

Preston's rule of thumb when evaluating products is that only a handful of use cases require a specialized approach. The rest comes down to personal preference. Ninety percent of the world could probably use any one of [the products on the market] and get the job done."

Hardware-based approaches (see chart below) require no changes to the backup software already in use. The hardware can be optimized to boost performance, but dedupe calculations are done in software.

Hardware-based deduplication products


Virtual tape library/Fibre Channel

Data Domain DD Series

Data Domain DD Series

EMC Disk Library

EMC Disk Library

ExaGrid EX Series

FalconStor Software VTL with SIR

Hewlett-Packard (HP) Co. StorageWorks D2D Backup Systems

HP StorageWorks Virtual Library System (VLS)

NEC Hydrastor

IBM Corp. ProtecTier

NetApp NearStore

NetApp NearStore

Quantum Corp. DXi Series

Quantum DXi Series

Sepaton Inc. DeltaStor

Generally, the difference between the IP-based NAS and VTL approaches is sharpest when it comes to performance scalability. MultiCare Health System, a group of hospitals and health clinics in Tacoma, Wash., chose Sepaton Inc.'s S2100-ES2 VTL after first trying Data Domain Inc. for backing up Windows data because it scaled better.

Eric Zuspan, senior system administrator of SAN/Unix at MultiCare, told in January "performance was pretty limited" with the Data Domain DD460 and DD560 deduplicating disk arrays. A typical Windows backup that had taken 4.5 hours with Data Domain took an hour and 20 minutes with Sepaton. Zuspan said his company's Windows team still uses Data Domain, but may phase out its arrays over time.

Inline vs. post-process data deduplication

At high enough capacities, some in-line vendors argue that post-process deduplication can exceed backup windows. Orange County Sherriff Department backup and email administrator Douglas Blackburn, a Data Domain user, said he likes inline dedupe, "because when you're done, you're done." However, the advantage to post-processing deduplication is that there are no worries about the CPU-intensive deduplication process creating a bottleneck between the backup server and the secondary storage target.

Sepaton and FalconStor more recently began offering what they call concurrent processing. That approach still moves data to a disk staging area first, but doesn't wait for backups to finish before deduping.

Spanning the globe: Global dedupe

As customers dedupe more data, it will be important for vendors to support the ability to dedupe data across multiple controllers. This is known as global deduplication. So far, only FalconStor, IBM, NEC and Sepaton have products that do this.

As customers dedupe more data, it will be important for vendors to support the ability to dedupe data across multiple controllers.
To compensate, customers have to split backup streams among multiple boxes and balance the load between them. Orange County is running up against a scalability issue with Data Domain -- it's about to add the second expansion shelf to its primary DD560, which is the limit for that box. If Orange County adds another box, that system will not be able to see the other Data Domain devices in the environment. Blackburn said he was looking to put Data Domain's DD690 gateway in so he could choose his own storage and scaling strategy on the back end.

"I'd like to use our own EMC storage," Blackburn said.

However, Data Domain's gateways do not support EMC storage -- this is one example of interoperability sticking points common to this market today. Data Domain's CEO Frank Slootman laid the blame for that squarely at the feet of EMC.

"I don't think EMC has put the customer first in terms of working with us," he said. An EMC spokesperson declined comment.

Some customers choose between hardware targets based on whether they integrate with tape. "Both [Data Domain and Quantum] were really aggressive in terms of pricing," said Ben Barnes, IT infrastructure manager for AIC Ltd., in Ontario, Canada. "But in the end, when we looked at what we were quoted by Quantum, it was all-inclusive -- systems, tape, licensing and support, all in one price. Data Domain was not a one-stop shop."

That's no surprise. Quantum's legacy is as a tape vendor, although its disk-based business is growing much faster now. Data Domain sprang up as an alternative to tape. Barnes deployed Quantum's DXi 5500 FC VTL with data deduplication approximately nine months ago.

Though many companies at the same capacity point as Barnes are working to get rid of tape altogether, Barnes said regulatory compliance means tape is here to stay. The company makes monthly full backups to tape using Symantec Corp. Backup Exec and a Quantum Scalar 50 tape library for archival purposes. This process takes about 24 hours, he said, but because operational restores are now done from disk, "there's no process impact -- we just leave it and let it run."

The FC VTL system was also installed with growth in mind. "If the business changed, we don't want to be in a situation to have to do something drastically different," he said.

Symantec OpenStorage API: a hybrid approach

Since the early days of disk-based backup, users have been concerned about how disk devices would work with software designed for tape-based backup. The big worry was about maintaining catalog consistency within the backup software when an integrated virtual tape library makes remote copies.

Symantec Corp.'s response to this was to introduce an option called the OpenStorage API (OST) in version 6.5 of its NetBackup software in 2007. OpenStorage API partners include VTL vendors EMC, FalconStor Software Inc., IBM/Diligent Technologies Corp., Quantum, Sepaton and Sun Microsystems Inc., but the most widely deployed OST partner is Data Domain.

With OST, there's no VTL layer to deal with, and if you have a new server, you can add it to an existing policy, and you're done."
Rich VanLare
network administratorRegency Centers
Symantec and Data Domain claim the OST version of Data Domain's target-based data deduplication appliance runs twice as fast as the appliance on its own. Deepak Mohan, senior vice president in Symantec's information management group, said this is because OST gives the backup software and dedupe device equation visibility into the other side.

"This means that the NetBackup data stream is routed to the appliance in an efficient manner, with optimized block sizes, catalog transfers and handshaking -- the recovery process is optimized also," Mohan said.

Rich VanLare, network administrator for Regency Centers, said the OST integration with NetBackup sold him on Data Domain. "I have about eight administrators to back up 140 or so servers, and I want to make sure the entire process is very simple," he said. "With OST, there's no VTL layer to deal with, and if you have a new server, you can add it to an existing policy, and you're done."

VanLare also said he was happy with the performance of OST. "What was sold to me was that Data Domain claimed to be able to handle 45 streams at once, but I'm pushing 90," he said. VanLare also said his dedupe ratio was better than advertised. "I was told we'd get between 30 percent and 60 percent reduction, but we're at 94.1 percent," he said.

There is talk in the industry that EMC is working on an answer to OST through its NetWorker backup software. A company spokesperson declined comment.

A drawback to the OST method is that it requires vendors to cooperate, and not every vendor is willing in a highly competitive environment. CommVault and Data Domain have many joint customers and worked together closely, but their partnership dissolved after CommVault added its own subfile-level deduplication to its backup software.

VanLare said the NetBackup GUI sometimes counters the simplification gained by using OST.

"NetBackup has a lot of work to do," VanLare said. "For the premier product in this space, it has a lot of shortcomings with its interface. It's difficult to use -- it's like you can see all the guts behind it, like in The Matrix, where they're watching all kinds of numbers fly by, but there aren't many easy clickable buttons to work with."

Software-based data deduplication approaches (see chart below) generally offer customers more flexibility and can be used to spread deduplication to broader sections of the IT environment than hardware-based approaches.

Software-only deduplication approaches

Client-side software products

"Target" or hybrid software products

Asigra Inc. Televaulting

CommVault Simpana v. 8

EMC Corp. Avamar

IBM Corp. Tivoli Storage Manager (TSM) v. 6

Symantec Corp. PureDisk

Symantec Corp. PureDisk

Client-side software like Symantec's PureDisk is designed for remote offices or network-bound environments where users want to cut the amount of data they're sending over the wire.

To back up its remote offices, the Iowa Dept. of Revenue used a Symantec Backup Exec agent at seven remote field offices to send tape jobs over the wide-area network (WAN) to the main data center. This required 60 different tape jobs at the main data center to complete backups for 200 GB of data per week, according to senior network engineer Mark Wise.

"The amount of data going over that network was just too much," Wise said. "One of the sites was taking 26 hours to do a full backup. When you're dealing with remote offices and remote networks, the network has more latency and interruptions to communication. The Backup Exec agents were timing out and failing."

In late November 2007, the department deployed PureDisk, bringing the backup window down from up to 26 hours to an average of 30 minutes.

CommVault, which added dedupe to its Simpana software suite, argues that embedding dedupe in the core backup application can lead to speedier recovery times and better catalog consistency because the same catalog is used to track dedupe. Symantec addresses catalog consistency with the OST approach, but PureDisk uses a separate catalog than NetBackup.

When dedupe is embedded with backup, it gives customers one throat to choke and lower costs than deploying a separate product. Furniture retailer Rooms to Go has used CommVault to back up approximately 25 TB of data at its central data center in Sefner, Fla., for five years. The retailer used CommVault's single-instance storage (SIS) for file-level dedupe prior to the release of Simpana 8.

"We were really anxious to at least get the file-level dedupe," said Jason Hall, director of IT systems for Rooms to Go.

Hall said he was willing to stick with SIS until CommVault added subfile dedupe to Simpana 8.

"If we were backing up 100 TB a night or something, we'd probably have been more eager," he said. "But our full backups are about 5 TB -- we weren't anxious about it."

Others will mix and match, adding software specifically for dedupe even if using another vendor's backup software. One storage architect at a large telecom who asked not to be identified because he is not authorized to endorse products his firm uses, is dealing with petabytes of data, and added EMC Avamar for approximately 100 TB of file and virtual server backups while using NetBackup and replication to backup higher tiers of applications.

It's unusual for such a large shop to deploy Avamar, especially when it has thousands of hosts and performance-intensive applications. But "any backup process adds load to the CPU client," the architect said, and when it comes to file shares and multiple copies of operating systems on less mission-critical servers, "there's a huge opportunity for dedupe. We're into the seven figures in savings with Avamar compared to tape -- we've saved more than we've spent."

The telecom is in the process of upgrading to Avamar 4.1, which will support twice as much capacity per grid node as the pervious version. "The size of the grid defines the dedupe domain, and we can't get any commonalities between the grids," he said. Boosting the capacity within the grids will alleviate that problem at least partially. But going forward, "that'll be our biggest challenge -- how to manage bigger buckets in the dedupe domain. It won't keep us from doing it, but it also won't have the same amount of value for us that it could."

One drawback to software-based approaches is most require customers to replace the current backup software. And not all of the dedupe software apps support every backup product. For example, Avamar cannot use EMC's own disk libraries as a target. PureDisk is not integrated with Symantec's SMB backup product, Backup Exec, though Mohan said this is on the roadmap.

Software-based approaches require organizations to carefully evaluate tradeoffs when it comes to adding a process within an existing infrastructure, according to ESG analyst Whitehouse. Inline backup-server based approaches such as CommVault's Simpana or IBM's Tivoli Storage Manager (TSM) can avoid adding processing load to the client server while still reducing at least part of the network traffic. NetBackup PureDisk can also be deployed at the backup server or disk target level if the customer chooses.

"It's easy to go and shop for a hardware target," ESG's Whitehouse said. "Users need to more carefully consider how the various software-based inline dedupe methods distribute the CPU load among clients."

Dig Deeper on Data reduction and deduplication

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.